LNCS 2008 



B. Falsafi 

T. N. Vijaykumar (Eds.) 



Power-Aware 
Computer Systems 

First International Workshop, PACS 2000 
Cambridge, MA, USA, November 2000 
Revised Papers 



Lecture Notes in Computer Science 

Edited by G. Goes, J. Hartmanis and J. van Leeuwen 



2008 



Springer 

Berlin 

Heidelberg 

New York 

Barcelona 

Hong Kong 

London 

Milan 

Paris 

Singapore 

Tokyo 



B. Falsafi T. N. Vijaykumar (Eds.) 



Power-Aware 
Computer Systems 



First International Workshop, PACS 2000 
Cambridge, MA, USA, November 12, 2000 
Revised Papers 




Springer 



Series Editors 



Gerhard Goos, Karlsruhe University, Germany 
Juris Hartmanis, Cornell University, NY, USA 
Jan van Leeuwen, Utrecht University, The Netherlands 

Volume Editors 
Babak Ealsafi 

Carnegie Mellon University 

Department of Electrical and Computer Engineering 
5000 Forbes Ave. 

Pittsburgh, PA 15213, USA 
E-mail: babak @ cmu.edu 

T. N. Vijaykumar 
Purdue University 

School of Electrical and Computer Engineering 
1285 EE Building 
W. Lafayette, IN 47907, USA 
E-mail: vijay@ecn.purdue.edu 

Cataloging-in-Publication Data applied for 
Die Deutsche Bibliothek - CIP-Einheitsaufnahme 

Power aware computer systems : first international workshop ; revised papers 
/ PACS 2000, Cambridge, MA, USA, November 12, 2000. B. Ealsafi ; T. N. 
Vijaykumar (ed.). - Berlin ; Heidelberg ; New York ; Barcelona ; Hong Kong ; 
London ; Milan ; Paris ; Singapore ; Tokyo : Springer, 2001 
(Lecture notes in computer science ; Vol. 2008) 

ISBN 3-540-42329-X 



CR Subject Classification (1998): B, C, D.l, D.3, F.3 
ISSN 0302-9743 

ISBN 3-540-42329-X Springer- Verlag Berlin Heidelberg New York 



This work is subject to copyright. All rights are reserved, whether the whole or part of the material is 
concerned, specifically the rights of translation, reprinting, re-use of illusfiations, recitation, broadcasting, 
reproduction on microfilms or in any other way, and storage in data banks. Duplication of this publication 
or parts thereof is permitted only under the provisions of the German Copyright Law of September 9, 1965, 
in its current version, and permission for use must always be obtained from Springer- Verlag. Violations are 
liable for prosecution under the German Copyright Law. 

Springer- Verlag Berlin Heidelberg New York 

a member of BertelsmannSpringer Science-l-Business Media GmbH 

http://www.springer.de 

© Springer-Verlag Berlin Heidelberg 2001 
Printed in Germany 

Typesetting: Camera-ready by author, data conversion by PTP-Berlin, Stefan Sossna 
Printed on acid-free paper SPIN 10782036 06/3142 5 4 3 2 1 0 



Preface 



The phenomenal increases in computer system performance in recent years have 
been accompanied by a commensurate increase in power and energy dissipation. 
The latter has directly resulted in demand for expensive packaging and cooling 
technology, an increase in product cost, and a decrease in product reliability 
in all segments of the computing market. Moreover, the higher power/energy 
dissipation has significantly reduced battery life in portable systems. While sy- 
stem designers have traditionally relied on circuit-level techniques to reduce po- 
wer/energy, there is a growing need to address power/energy dissipation at all 
levels of the computer system. 

We are pleased to welcome you to the proceedings of the Power- Aware Com- 
puter Systems (PACS2000) workshop. PACS2000 was the first workshop in its 
series and its aim was to bring together experts from academia and industry to 
address power-/energy-awareness at all levels of computer systems. In these pro- 
ceedings, we bring you several excellent research contributions spanning a wide 
spectrum of areas in power-aware systems, from application all the way to com- 
pilers and microarchitecture, and to power/performance estimating models and 
tools. We have grouped the contributions into the following specific categories: 
(1) power-aware microarchitectural/circuit techniques, (2) application/compiler 
power optimizations, (3) exploiting opportunity for power optimization in in- 
struction scheduling and cache memories, and (4) power/performance models 
and tools. 

The first and third group of papers primarily address the opportunity for 
power optimization at the architectural/microarchitectural level. While there 
are large variabilities in hardware resource utilization within and across appli- 
cations, high-performance processor cores are designed for worst-case resource 
demands and therefore often waste power/energy while idling. The papers in 
these groups propose techniques to take advantage of the variability in resource 
demand. These include resizing instruction issue queue, dynamically varying the 
instruction issue width, supply-gating inactive cache block frames to reduce lea- 
kage power/energy dissipation, mechanisms to trigger dynamical voltage scaling, 
and predicting functional unit activation time to reduce inductive noise during 
the on/off transition. 

The second group of papers focus on application transformations and compi- 
ler optimization techniques to reduce energy/power. The papers propose energy- 
efficient dynamic memory allocation schemes for an MPEG multimedia player, 
power-aware graphics rendering algorithms and hardware to exploit content va- 
riation and human visual perception, and compiler optimization to estimate 
memory-intensive program phases that can trigger dynamic voltage scaling to 
slow down the processor clock. 

The last group of papers present power/performance estimation models for 
high-end microprocessors. The first paper presents a tool from Intel that inte- 
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grates both analytical models and empirical data for power estimation with Sim- 
pleScalar, a cycle-accurate performance estimator. The second paper describes 
two simulation models from IBM to evaluate power/performance characteristics 
of PowerPC processors. The models trade off accuracy for speed. The third pa- 
per compares and contrasts the effectiveness and merits of two existing tools for 
power/performance estimation. 

PACS 2000 was a highly successful forum, thanks to a number of high-quality 
submissions, the enormous efforts of the program committee, the keynote spea- 
ker, and the attendees. We would like to thank Shekhar Borkar for preparing the 
keynote speech, pinpointing the technological scaling trends and their impact on 
energy/power consumption in general and the increase in transistor subthres- 
hold leakage current in particular. We would like to thank Konrad Lai for the 
excellent keynote delivery on behalf of Shekhar, and accepting to substitute for 
Shekhar on short notice, due to an emergency. We would like to thank Larry 
Rudolf, James Hoe, and Boon Ang and the other members of the ASPLOS- 
IX organizing committee who publicized the workshop and helped with local 
accommodation. 
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Abstract. This paper focuses on system-level design methods for low energy 
consumption in architectures employing variable- voltage processors. Two low- 
energy design flows are introduced. The first. Speed-up and Stretch, is based on 
the performance vs. low-energy design trade-off. The second, Eye-on-Energy, is 
based on energy sensitive scheduling and assignment techniques. Both of the 
approaches presented in this paper use simulated annealing to generate task-to- 
processor assignments. Also, both use list-scheduling based methods for sched- 
uling. The set of experiments presented here characterize the newly introduced 
approaches, while giving an idea about the cost vs. low-energy and performance 
vs. low-energy design trade-offs a designer has to make. 



Keywords: low energy, system-level design, variable voltage processors 



1 Introduction 

From mobile computing and communication to deep-space applications and medical 
implants, low energy consumption is a main design requirement. Several methods for 
lowering the energy consumption for such systems have already been developed. To 
obtain a highly efficient system, these methods have to be applied all the way through- 
out the design cycle. Since targeting low energy and low power as early as possible in 
the design process is most prolific [1], the work presented in this paper focuses on sys- 
tem-level design methods. Today, processors supporting selective shut-down and 
power modes are present in every energy efficient system. Although shutting down 
idling parts is one way to further reduce the energy consumption [2,3], it is more effec- 
tive to slow down selected computations and run the processing units at lower supply 
voltages [6]. The most worthwhile configuration to decrease the energy consumption 
involves processors which can adjust their supply voltage at runtime, depending on 
their load [12,13]. The main idea behind our work is to decide beforehand the best sup- 
ply voltages for each processor throughout their functioning (also referred to as “sup- 
ply voltage scheduling”). According to this description, there are certain similarities 
with the problem of behavioral synthesis for multiple supply voltage circuits, an area 
in which much work has been done (e.g. [1 1]). At behavioral level, once a unique sup- 
ply voltage for a functional unit is determined during design, it will be constant at runt- 

This work was sponsored by ARTES: A network for Real-Time reseai'ch and graduate Education in Sweden, 
http ://w w w. artes.uu.se/ 
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ime. But the problem of scheduling the supply voltages at system level is different in 
principle. The supply voltage of a dynamic voltage processor can vary at runtime, 
which offers more flexibility and greater potential for reducing the energy. For this rea- 
son, several researchers have addressed energy related issues as specific problems at 
system level [1-9]. Selecting the system architecture and the distribution of the compu- 
tations can greatly influence the overall energy consumption [5,8,9]. An optimal pre- 
emptive scheduling algorithm for independent tasks running on a single processor with 
variable speed is described in [4]. In the same context, a low-energy oriented heuristic 
for non-preemptive scheduling is described in [6]. In [7], Ishihara and Yasuura present 
another scheduling technique, employing variable voltage processors and independent 
tasks. There, the authors show that the energy can be further reduced by running each 
task in two phases, at two different supply voltages. 

In this paper, we present two system-level design flows, both solving the architec- 
ture selection (referred to as assignment) and task scheduling problems together, to 
find near-optimal low-energy designs. The first approach is a simple and rather ad hoc 
method based on the performance vs. low-power trade-off. The second approach 
involves energy estimates and a specific energy-controlled design flow. The execution 
model of the tasks is similar to the one described [7], namely we also assume that each 
tasks executes in two stages at different supply voltages. In both design flows we per- 
form non-preemptive static scheduling, using list-scheduling with appropriate priority 
functions. Unlike some of the previous work, we handle dependent tasks. We also 
present a more complete view of a system-level design flow for low-energy, rather than 
only scheduling or architecture selection. 

The rest of the paper is organized as follows. Section 2 presents our generic view 
of a system-level design flow. Section 3 briefly presents the theoretical basis for using 
variable voltage processors to lower the energy consumption. Our two low-energy 
design methods are introduced in section 4. Section 5 briefly presents the scheduling 
method we use. Section 6 describes of our low-energy assignment strategy. Section 7 
contains experimental results. Finally, section 8 presents our conclusions. 

2 Design Flow Overview 




Fig. 1. Generic System Design for Low Energy 



In this section, we briefly present our 
view of the system-level design flow 
(Fig. 1), as a basis for our low-energy 
design flow. The input data are a task- 
graph, the available resources, and an 
execution deadline. The task-graph 
describes the system functionality in 
terms of dependent tasks and often comes 
from a preceding partitioning step. The 
number and type of available processors 
is often given by the designer. We assume 
that the processors are variable supply 
voltage processors. By altering the 
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resource pool, the designer can explore the design space searching for good solutions. 
The execution deadline is usually given as a requirement. The design process, as 
described here, is time-and-resource constrained. We adopt a step-by-step design flow, 
performing the assignment of tasks to processors followed by scheduling the tasks on 
their processors. Both of these two steps are NP-hard problems in the general case, and 
have to be handled by heuristics. We use simulated annealing to find a good assign- 
ment and a constructive heuristic to perform scheduling. We describe two specific 
approaches for energy reduction in the later sections. By the end of the design process 
the following data is available: the exact architecture of the system, the timing behav- 
ior of each task, and, finally, which processor will execute which task. The designer 
can easily modify the pool of available resources and re-iterate the whole process, to 
explore other design choices. 

3 The Role of Supply Voltage in Energy Consumption 

Consider that task T is executing during N clock cycles on a processor P, which runs at 
supply voltage V and clock frequency /. For the given voltage V, processor P has an 
average power consumption 7t. The energy consumed by executing task T on processor 
P, running at supply voltage V, is computed as: E = {N ■ n)/f. The average power 
consumption dependency on the supply voltage and execution frequency is given by: 
71 = K^ - f -V^ , where is a task/processor dependent factor determined by the 
switched capacitance. Combining the above two formulae, we can rewrite the energy 
as: E = N ■ Ef y = N ■ K^- , where Ef^ y is the average cycle energy. From this, we 

can conclude that lowering the supply voltage would yield a drastic decrease in energy 
consumption. At the same time, the supply voltage affects the circuit delay, which sets 
the clock frequency, since / ~ 1/ delay . Formally, delay = K,, ■ V/{V - Vjf , where V 
is the supply voltage, Vj is the threshold voltage, and K[, is a constant. Thus, decreas- 
ing the voltage leads to lower clock frequencies and thus to longer execution time. 

Processors supporting several supply voltages are already available on the market. 
The various voltages yield different execution delays and energy consumption for the 
same task T. Fig. 2 depicts an illustrative example of scheduling a task-graph (a). In a 
classic scheduling technique, such as list-scheduling, voltage selection is not consid- 
ered (b). In order to obtain energy savings, one must choose the best supply voltage on 
each processor for each task (c). In practice this can not be done, since the number of 
available supply voltages for a processor is limited. In (d) a feasible schedule is repre- 
sented, using processors with two voltages. 

In the case of real processors, with a limited number of supply voltages, the mini- 
mal energy for executing a task is obtained by using only two different supply volt- 
ages, as shown in [7]. These voltages are the ones around the “ideal” supply voltage for 
the given deadline. The “ideal” supply voltage is the unique voltage which sets the exe- 
cution time for that task exactly to the deadline. Only processors with continuous 
range of supply voltages can achieve the “ideal” voltage. On the other hand, the pro- 
cessors with a discrete range of supply voltages can use only those available. When the 
allowed task execution time t is between tj (obtained for Vj) and t 2 (obtained for V 2 ), 
the task will execute partly at Vj, and partly at ¥ 2 - The execution time f can be 
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a) The task-graph 
to be scheduled 
(already assigned) 





c) Ideal case:P1 and P2 can run at any supply voltage d) Real case: only two supply voltages available 



Fig. 2. Voltage scheduling for low energy on multiple voltage processors. 

The area under the power profile reflects the energy consumption. 

expressed depending on the number of processor cycles run at the two supply voltages, 
for the resulting clock frequencies, fj and/ 2 : t = xf^ + {N - x)f2 , where x is the num- 
ber of cycles run at supply voltage Vj. Moreover, task energy can be expressed as a 
function of clock cycles: E(x) = xEf^^ y^ + {N - x)E^^^ . From the last two equations 

we can deduce the expression of energy as a function of time, E(t), which is a linear 
dependency. An example of how this dependency may look like for a processor with 
three supply voltages is given in Fig. 3. A certain execution duration for a task 
uniquely identifies the number of cycles needed to execute at the different voltages. 

The model described above (depicted 



n minimal delay w 



maximal delay 



maximal energy 
real E(t) 




minimal 

energy 



ideal E(t) 



task execLjtion delay 



I 

‘U ‘■1 ‘■z 

Fig. 3. Task energy relative to its execution time on 
a three voltage processor. 



in Fig. 2.d) is the execution model 
assumed in our approach. A task runs 
in two phases, at two different supply 
voltages, chosen from the available 
voltages. Whenever the processor is 
idle, it is shut down. In this case, the 
ideal energy function is approxi- 
mated by segments between the 
neighboring voltages (see Fig. 3). An 



increased number of available voltages yields a better approximation of the ideal 
energy function, leading to a more efficient design. 



4 Design Alternatives for Low Energy 

To get an energy efficient design, all the steps in the design flow should get feedback 
from the subsequent steps. If these steps are performed independently, only local can 
be found optima. For this reason, in the assignment step we use a heuristic search 
guided by a cost function, estimating the final outcome of the synthesis. Next, we 
introduce two methods for low-energy system design, based on the design flow intro- 
duced in Fig. 1. 
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4.1 The Speed-up and Stretch Approach (S&S) 

The first method is what we consider one of the simplest, although not trivial, 
approaches to low-energy system design (Fig. 4). It is based on the idea that one can 
trade execution speed for low power/energy. In principle the task-graph is assigned in 
such way that the schedule is as short as possible. This assignment is performed using 
simulated annealing, as described later on. Then, this tightest schedule is stretched, by 
lowering the supply voltage selectively for each processor, such that it covers all the 
time interval to the desired deadline. Scheduling is performed using list-scheduling 
with a priority function based on critical path. Thus, the method does not require any 
information about the energy consumed by the processors. It uses the assignment and 
scheduling techniques from a classic, non-low-energy oriented approach. 

Next we present the theoretical basis for the idea of scaling the tightest schedule as 
a low-energy schedule. Consider first the ideal case when the tasks are independent, 
running on one processor. The processor is also ideal, in the sense that it can run at any 
supply voltage. Also the threshold voltage is always small enough to be negligible or 
can be proportionally adjusted at runtime [12]. We start from the task i energy expres- 
sion for the reference voltage (Vq), and for the voltage after scaling {V,)\ 



E„ = N, ■ ■ Vo' , £, = A, ■ V' (1) 

For a sufficiently small threshold voltage, the clock frequencies for the two voltages 
are (section 3): /g = ■ Vq , /; = A,, • V (2) 

From (]), (2)- E, = E^, ■ (///o)^ (3) 

From section 3: Eq, = (Nj-nQ)/fo (4) 

From (3), (4)\ = A,- • ■ fj/fl (5) 

If 5, is the task execution delay: /g = A/hg;,/, = A,/5, (6) 

From (5j,(6J: £,- = ng • 5^/5? (7) 



The total energy of the task graph is the summation of the individual task energies: 
E = YjEj = tIq ■ , where we know that the tasks execute during the whole time 

until the deadline = d , and the tightest possible deadline is X§o/ = ■ 

The lower bound for the total energy is Tig ■ dl/d "^ , provable by mathematical 
induction. This lower bound can be obtained only when 5, = 5g, ■ d/rfg , thus by scal- 
ing the tightest schedule to fit the new deadline. The supply voltage associated with the 
new schedule can be easily computed using (7). Note that the processor will run at the 
same, single supply voltage for all tasks. This method can be extended to several pro- 
cessors executing independent tasks. 

If we consider partially ordered tasks or task-graphs, we have to use a scheduling 
strategy which can handle dependencies. Here, we consider list-scheduling with criti- 
cal path as priority function. The task-graph is scheduled using list-scheduling, obtain- 
ing the tightest possible deadline. The schedule is then scaled to fit the desired 
deadline. The scaling factor is the ratio between the desired deadline and the deadline 
obtained by list-scheduling, as for the independent tasks case, presented earlier. This is 
a fast scheduling method and can be used directly in the assignment generation step as 
a cost function. 
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4.2 The Eye-on-Energy Approach (EonE) 

One of the problems with the previous approach is that the scheduling strategy works 
fine only for the tasks on the critical path. These tasks are the only ones using the avail- 
able resources at best, as in the ideal case. An improved scheduling strategy would be 
to take advantage of the low processor load for the non-critical path. This means differ- 
ently stretching the tasks, depending on the available processor time. We have devel- 
oped such a scheduling strategy (Scaled Low-Energy Scheduling - ScaledLEneS) 
using a list-scheduling algorithm with an improved priority function [10]. For the same 
assignment, our ScaledLEneS algorithm can save up to 25% energy compared to the 
old list-scheduling approach with stretch. Details are presented in section 5. 

Although our new scheduling algorithm saves energy compared to the simpler list- 
scheduling and stretch approach (S&S), it takes longer time to perform scheduling. For 
example, scheduling a task-graph with 56 tasks on eight processors takes around 5 
minutes. This is far too long to include in a simulated annealing loop, as in the previ- 
ous approach. To overcome this problem, we built a function which estimates the 
energy consumption when using our scheduling strategy. The estimator is tuned in a 
separate phase, previous to the heuristic search. The estimator is as fast as the classic 
list-scheduling, used as cost function in the S&S approach. Details about the energy 
estimation function and the tuning step are given in sub-section 6. 1 . 

In this context, we developed an improved design flow (Fig. 5), which targets the 
energy consumption more directly than the previous one. At first, the energy estimator 




Fig. 4. The Speed-up & Stretch Approach 

The figure details the dashed line box from Fig. 1 Fig. 5. The Eye-on-Energy Approach 

is tuned for the current task-graph by fitting its estimates to the energy values obtained 
after a complete (and time-consuming for that matter) scheduling with ScaledLEneS. 
Then we perform a heuristic search using simulated annealing (SA), as in the S&S 
approach, except this time the search is directed by the estimated energy, as opposed to 
schedule length. Finally, with the best assignment found by SA, the design is sched- 
uled using ScaledLEneS. 
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5 Low Energy Scheduling 

The scaled low-energy scheduling algorithm (ScaledLEneS) used in our design flow 
works in two basic steps. First we schedule the already assigned task-graph using list- 
scheduling with a specially designed priority function. The schedule obtained in this 
step is as tight as possible, as in a typical list-scheduling with a critical path based pri- 
ority function. The tasks on the critical path will execute at their maximal speed, using 
the highest supply voltage for their assigned processor. The tasks which are not on the 
critical path, will be executed at a lower supply voltage, and thus at a lower speed and 
with a lower energy consumption. In the next scheduling step, this tight schedule is 
stretched to fit the required deadline, thus reducing the energy even more, as intro- 
duced in sub-section 4.1. Here, we focus on the first step of ScaledLEneS, called 
LEneS, since the schedule stretching is performed exactly as in the S&S approach. 
Next we describe the features which make LEneS special: the input data and the prior- 
ity function used in list-scheduling. 

Since the tasks can have variable length, each task is modeled by two connected 
nodes, representing the start and the end of the execution of that task. The task- 
graph(TG) is thus translated during a preprocessing step into an Enhanced Task-Graph 
(ETG). An example of a translation is given in Fig. 6. The TG describes the system 
functionality in terms of dependent tasks. Nodes model the tasks, while directed arcs 
model the dependencies. Once a task (node) finishes its execution, the dependent tasks 
can start. The ETG is obtained from the TG by substituting each node with a pair of 
nodes. A start node (circle), marks the beginning of the execution of a task and an end 
node (grey disk), marks the completion of a task. The numbers in Fig. 6 denote execu- 
tion times. The execution times of the tasks, assigned to nodes in the TG, are assigned 
to the internal edges in the ETG. In our current model, only computational tasks are 
subject to change their execution delay, while the communication delays remain fixed. 
The thick edges in the ETG represent the fixed delays. The other edges depict modifi- 
able delays, and the associated numbers describe their minimal values. 

During the actual scheduling step, every start and end node is scheduled separately. 
This enhancement allows us to modify the actual execution time of the tasks, and thus 
indirectly control the supply voltages for the assigned processor. 




3i 



,0 



7i 



Fig. 6. A task-graph (a) is transformed into an enhanced task-graph (b) 
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Another important aspect of LEneS is the energy sensitive priority function. In 
principle, the priority of a node (start or end) reflects the energy consumption impact 
(denoted as f(i,t)) of scheduling that node i at a certain time moment t also including 
the influence of this action on the rest of the nodes, which are not scheduled yet. A 
detailed description of how to compute /(i,t) is given in [10]. Since the time constraints 
are not captured by function/, using it as a priority does not always lead to feasible 
schedules for tight deadlines. To be able to find schedules even for tight deadlines, we 
use the following priority function: 

prio{i, t) = f{i, t) + a,- • - — — . — —T 

deadline - t - criticatpath{i) 

Criticalpath(i) is the delay of the longest path starting in node i. Each ETG node 
has an associated coefficient a,-, which controls the emphasis on lowering energy vs. 
generating a tight schedule. Having a different a for each node allows us to treat the 
nodes on the critical path in a different manner, focusing, for those nodes, more on ful- 
filling the deadline than on lowering the energy. Each a,- is tuned inside the LEneS 
algorithm by starting from small values and increasing some of them gradually, if the 
deadline can not be fulfilled. With the priority given above, if all a,- are large enough, 
the priority function behaves as a critical-path priority. The set of smallest a,- for a 
graph and a deadline yields the lowest energy consumption for that graph and deadline. 

6 Low Energy Assignment 

6.1 The Energy Estimator 

In our design flow, the low-energy scheduling algorithm is necessary for an accurate 
evaluation of a solution. Unfortunately it takes far too long time to directly control the 
heuristic search. For this reason we built a fast energy estimator which approximates 
the value obtained by the actual ScaledLEneS scheduling. The function used for esti- 
mation is: 



where is the maximal energy consumption for the given assignment, obtained 
when all the tasks run as fast as possible. Pr is the number of processors used by the 
assignment. is the length of the fastest schedule, obtained through list-scheduling 
with a critical-path priority function. 7].^^ is the required deadline, is the energy 
square deviation taken over each processor. Finally, a, b, and c are constants, specific 
for every task-graph. The reason behind choosing such function is the following. The 
first ratio gives in principle the average maximal energy consumed by each processor. 
The bigger this value is, the larger the final energy consumption gets. The second ratio 
describes the scaling ability of the schedule. If this factor is high, meaning that even 
the tightest schedule needs a long time to complete, the more likely there will be very 
little extra time for scaling. The square power comes from the time-vs. -energy depen- 
dency. The second term in the estimator expression describes how well balanced the 
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energy is among the available processors, for the tightest schedule. Several of our 
experiments suggested that a smaller value leads to a lower final energy consumption. 

The a, b, and c parameters are tuned by fitting the function to the energy values 
obtained for several random assignments, scheduled with ScaledLEneS. Given the 
function shape, we use non-linear regression to determine a, b, and c. 

The algorithmic complexity required to compute E for a given assignment is given 
by the complexity of determining its parameters. The most costly of all is the 
length of the tightest schedule, which is determined through a classic list-scheduling. 
Thus, the complexity of computing the cost of an assignment is the same as in the S&S 
approach. 

6.2 Searching Using Simulated Annealing 

Finding the best task-to-processor assignment is an NP-hard problem. For this reason 
we use a well known and easy to implement heuristic search method: simulated 
annealing (SA). There are several aspects which define a specific implementation of a 
simulated annealing algorithm. 

The neighborhood is defined in terms of one type of move, which is re-assigning 
one randomly selected task to a different processor. For the SA implemented in the 
S&S approach, a random task is assigned to the processor which yields the fastest exe- 
cution time for that very task. For the SA implemented in the EonE approach, a ran- 
dom task is assigned to the processor which yields the lowest energy consumption for 
that task. To allow the search to exit local minima, whenever the selected task is 
already assigned on the best processor (best local assignment), a random assignment is 
performed for that task. 

The parameters of the SA cooling scheme can be used to speed up the search with 
the risk of obtaining worse solutions, as shown in Fig. 8. The number of iterations at 
one temperature is dynamically computed depending on the problem size and tempera- 
ture. The stopping condition is given by the dynamics of the solution. If the solution 
remains unchanged (within a 5% range) in the last three temperatures, the search stops. 
The SA cost function in the EonE approach is the estimated value of energy. For the 
S&S approach, the cost function is the schedule length. 

7 Experimental Results 

In this section, we present four sets of experiments. The first set shows the accuracy of 
the energy estimator used in the EonE design flow method. The second set presents the 
behavior of our simulated annealing implementation and its tuning abilities. The third 
set compares the two design methods introduced in this paper for several random 
graphs. Finally, the fourth and last set of experiments shows the energy savings capa- 
bilities of our S&S and EonE design methods for a real life example. All experiments 
assume processors able to run at four different supply voltages, between 3.3V and 
0.9V, with correspondingly adjusted threshold voltages, using e.g. the method given in 
[ 12 ]. 
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The first experiment shows the accu- 
racy of the energy estimator, 
described in sub-section 6.1, built for 
the EonE approach. For this we used 
one hundred random graphs of thirty 
nodes and with a dependency depth 
no larger than five tasks. The assign- 
ment is done on six processors with 
different average power consumption. 

The tightest possible deadline with 
the given resources was obtained 
using SA with the schedule length as 
cost function. Eirst we fitted the a, b, 
and c parameters of the estimator 
using 30 different random assign- 
ments. The standard deviation (average, minimum, and maximum) of the estimates 
from the fitted values is drawn with grey in Eig. 7. Then, for other 30 random assign- 
ments we examined how well the estimates fit the actual energy values for these 
assignments. The standard deviation from these other assignments is depicted with 
black in Fig. 7. From these experiments we conclude that the average standard devia- 
tion of the estimates from the actual energy values obtained by LEneS is close to 8%, 
sufficiently accurate to be used as an estimator in the SA search. 

Next we analyze the behavior of the 
SA implemented to generate the 
assignments. Depending on the cool- 
ing scheme, SA can “freeze” faster or 
slower, as depicted in Fig. 8. The 
drawback of fast cooling schemes is 
that the near-optimal solution is worse 
than the one of slow cooling schemes. 
The next set of experiments compares 
the two low-energy approaches pre- 
sented in the paper, S&S and EonE. 
We used one hundred graphs with 
twenty nodes and with a dependency 
depth no larger than four tasks. The 
reference deadline is the tightest pos- 
sible deadline with the available resources, obtained exactly as in the first experiment. 
Fig. 9 depicts the energy saved by the EonE approach compared to the S&S approach, 
as average, minimal, and maximal values, for various deadline extensions. On average, 
EonE can save around 15% energy in comparison to S&S. In some cases, it can save as 
much as 34% energy. The negative values appear because of the energy estimation 
error which is around 9% as shown before, which purposes sometimes the S&S 
method to behave better. 




Fig. 8. Searching for solutions using SA on the 
same task-graph. 
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Next we present the energy sav- 
ing capabilities of the two design 
flows introduced in this paper. 
For this, we chose a real-life 
application consisting of a sub- 
system of an Unmanned Aerial 
Vehicle (see [14]). The sub-sys- 
tem we are interested in is an 
optical flow detection (OFD) 
algorithm, which is part of the 
traffic monitoring system. In the 
current implementation, the opti- 
cal flow algorithm consists of 32 



tasks, running on ADSP-21061L digital signal processors. Limited by other tasks, 
OFD can process 78x120 pixels at a rate up to 12.5Hz. Depending on the traffic speed 
or monitoring altitude, such a high processing rate is a waste of resources. In many 
cases, a much lower rate is sufficient. Because of this dynamic behavior, important 
energy savings would be obtained if the design were to use processors supporting mul- 
tiple voltages. Depending on the wanted rate, the processors can be run faster or 
slower, in order to reduce the energy consumption. Assuming we can run the DSPs at 
3.3V, 2.5V, 1.7V, or 0.9V, we applied our two low-energy design flows for the OFD. 
For two different processing rates, lower than 12.5Hz (column 1 in Table 1), we 
assumed different pools of available resources: with two, three, and four processors 
respectively (column 2 in Table 1). For each of these configurations we considered 
three design methods. First, we considered the real situation, when no voltage scaling 
can be performed (the “Single Vdd” column), but the processors are shut down when- 
ever they are idling. This energy is always the same, since the processors are identical 
and always execute at the same speed. This is the reference we compare the other 
methods to. Then, we applied our low-energy design methods and compared them to 
the reference energy (the “S&S” and “EonE” columns respectively). As reflected by 
Table 1, there is a trade-off between cost (number of processors) and low energy con- 
sumption. Note also that even for only 50% slower processing rate the energy con- 
sumption can be almost halved. The simple S&S approach performs fairly well, but for 
very low-energy applications the EonE approach is recommended. 



Table l:Low-energy design alternatives for the OFD algorithm. 



Processing Rate 


#of 

processors 


Energy in % of the Single Vdd Energy 


Single Vdd 


S&S 


EonE 




2 


100 


49.47 


48.95 


6.25Hz 


3 


100 


45.16 


41.18 


(half rate) 


4 


100 


42.07 


39.92 




2 


100 


71.33 


69.64 


8.33Hz 


3 


100 


60.13 


57.04 


(50% slower) 


4 


100 


55.86 


52.77 
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8 Conclusions 

In this paper, we presented two system-level design flows for low-energy on architec- 
tures with variable voltage processors. Both design flows are time and resource con- 
strained and include task-to-processor assignment and low-energy scheduling. We use 
simulated annealing to generate assignments and list-scheduling based algorithms for 
scheduling. The first method, Speed-up and Stretch (S&S), indirectly achieves a low 
energy consumption by trading speed for low-energy. The fastest possible assignment 
is found and the schedule is scaled to fit the desired deadline. The second approach, 
Eye-on-Energy (EonE), uses energy related knowledge. The EonE assignment genera- 
tion step uses an energy estimator. The scheduling step uses an energy sensitive prior- 
ity function in a list-scheduling algorithm. The experiments presented here show that 
both approaches can halve the energy consumption when the deadline is relaxed by 
50%. The EonE approach behaves better than S&S by an average of 13%. The experi- 
ments also show the cost vs. low-energy and speed vs. low-energy trade-offs a designer 
has to make. Our methods are especially useful for applications which allow variable 
speed processing rate, as in the real life example presented here. 
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Abstract. Because the inductive noise Ldi/dt is induced by the power 
change and can have disastrous impact on the timing and reliability of 
the system, high-performance CPU designs are more concerned with the 
step power reduction instead of the average power reduction. The step 
power is defined as the power difference between the previous and present 
clock cycles, and represents the Ldi/dt noise at the microarchitecture 
level. Two mechanisms at the microarchitecture level are proposed in 
this paper to reduce the step power of the floating point unit (FPU), as 
FPU is the potential “hot” spot of Ldi/dt noise. The two mechanisms, 
ramping up and ramping down FPU based on instruction fetch queue 
(IFQ) scanning and PC-1- A instruction prediction, can meet any specific 
step power constraint. We implement and evaluate the two mechanisms 
using a performance and power simulator based on the SimpleScalar 
toolset. Experiments using SPEC95 benchmarks show that our method 
reduces the performance loss by a factor of four when compared to a 
recent work. 



1 Introduction 

Because of the growing transistor budget, increasing clock frequency and wider 
datapath width in the modern processor design, there is an ever-growing current 
to charge/discharge the power/ground buses in a short time [1,2]. When the 
current passes through the wire inductance (L) associated with power or ground 
rails, the voltage glitch is induced and is proportional to Ldi/dt, where di/dt is 
the current changing rate. Therefore, the power surge is also known as Ldi/dt 
noise. Further, many power-efficient microarchitecture techniques involve selec- 
tively throttling, or clock gating certain functional units or parts of functional 
units [3,4, 5, 6, 7, 8]. Dynamic throttling techniques may lead to an even larger 
surge current. 

A large surge current may reduce the chip reliability, and cause timing and 
logic errors, i.e., a circuit may switch at the wrong time and latch the wrong 
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value. Dealing with a large surge current needs an elaborate power distribution 
and contributes to higher design and manufacturing costs. In this paper, we 
define the step power as the power difference between the previous and present 
clock cycles. We use the step power as a figure of merit for power surges at the 
microarchitecture level, and study how to reduce the step power by ramping up 
and ramping down functional units for high-performance processors. 

We use the floating point unit (FPU) to illustrate our ideas. The FPU con- 
sumes 10-20% power of the whole system. Its step power has a significant impact 
on power delivery and signal integrity in processor design. The FPU is turned 
on or off immediately in most previous research on the dynamic throttling, and 
results in a large step power. Recently, Tiwari et al [9,2] proposed to prolong the 
switch on/off time by inserting “waking up” and “going to sleep” time between 
the on and off state. However, every time the pipeline is forced to stall several 
clock cycles before an inactive resource becomes available. This may lead to a 
large performance penalty. 

In this paper, we proposed two new mechanisms to ramp up/down (turn 
on/off gradually) the FPU based on either the instruetion feteh queue (IFQ) 
seanning or the PC+N instruction prediction to meet the step power constraint 
specified by the designer. The main difference between our work and Tiwari’s is 
that we predict the instruction in advance whether the resource is required. This 
will enable a request signal to be placed early enough to ramp up the inactive 
FPU gradually and make it ready for use in time. We implement and evaluate 
our two mechanisms using a performance and power simulator based on the 
SimpleScalar toolset. Compared to [9,2], we can reduce the performance loss by 
a factor of 4 on average over SPEC95 benchmarks. 

The paper is organized as follows. Section 2 describes our two step power 
reduction mechanisms in detail. Section 3 presents the quantitative study of 
the step power and performance impact of the two mechanisms. Section 4 dis- 
cusses the possible implementation method of the two mechanisms, and section 
5 concludes this paper. 



2 The Step Power Reduction Mechanisms 

2.1 Overview 

As the first step towards the power reduction techniques, we have implemented 
an accurate microarchitecture level power estimation tool based on the extended 
SimpleScalar toolset, where the SimpleScalar architecture [10] is divided into 
thirty-two functional blocks, and activity monitors are inserted for these func- 
tional blocks [11,12]. We have developed a power hook as an interface between the 
extended SimpleScalar toolset and the RTF power models of functional blocks. 
After reading the system configuration and user specified RTF power informa- 
tion coming from the real design data or RTF power estimation tool [13,14,15,16, 
17], the activities and the corresponding power information of these functional 
blocks are collected in every clock cycle. Our resulting SimpleScalar toolset is 
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able to simulate the performance, average power and step power for every func- 
tional block and the whole system for given benchmark programs. All our power 
reduction techniques are implemented and tested on this toolset. 

The conventional floating point unit (FPU) design only has two states: in- 
active state and active state (see Figure 1(a)). When there are floating point 
instructions executed, the FPU is in the active state and consumes the active 
power (Pa)- On the other hand, FPU has no activity in the inactive state and 
dissipates the leakage power {Pi), about 10% of the active power {Pa) in the 
present process technology. When any floating point instruction gets into the 
FPU, the FPU will jump up from the inactive state to the active state in one 
clock cycle and has a step power of {Pa — Pi) (see Figure 1(a)). If we assume that 
the inactive power (leakage power) is 10% of the active power, the step power 
of FPU will reach 0.9Pa and may translate into a large Ldi/dt noise. 

Essentially, Figure 1 (b) illustrates the technique used in Tiwari’s work [9, 
2]. Stall cycles will be inserted to power up the functional units gradually every 
time when the inactive resources are needed and may lead to a big loss of the 
performance. However, our work predicts the occurrence of the floating point in- 
structions and prepares the FPU in advance to reduce this performance penalty. 
In both Tiwari’s and our approaches, the FPU will be powered down gradually 
to save power consumption, when it is not used for certain clock cycles. 

We introduce several artificial workload states in our approach. The relation- 
ship of the inactive state, artificial workload states, and active state of FPU 
is illustrated in Figure 1 (c). The FPU consumes power Pf,, i=l,2,...,n, and 
Pw > Pw~^ > > Pw if there are n artificial workload states. We assume that 

the power difference between adjacent power states is uniform for the simplicity 
of presentation. A special power state, which is only one step below the active 
state, is called subactive state and dissipates Ps power. After a floating point 
instruction is predicted, the FPU will ramp up and stay in the subactive state. 
The FPU enters the active state when the instruction gets executed. In summary, 
PO = P^, Pff = Ps and P-+1 = Pa. 

2.2 Ramp Up/Down FPU Based on the IFQ Scanning 

The SimpleScalar is a five-stage superscalar architecture. There are two inter- 
mediate stages between the instruction fetch (IF) and execution (EXE) stages. 
We can scan the fetched instructions in the instruction fetch queue (IFQ) of the 
IF stage every clock cycle. We call this mechanism IFQ scanning. If there exist 
floating point instructions, a request signal will be sent to the EXE stage directly 
to ramp up the FPU from the inactive state to the subactive state within pre- 
diction time by adding artificial workload gradually. Here, the prediction time 
is two cycles between IF and EXE stage. If the floating point instruction really 
gets into the FPU in EXE stage, the FPU will switch from the subactive state to 
the active state. Otherwise, FPU will be ramped down to the inactive state after 
the busy time, which is a user defined time to keep FPU in the subactive state. If 
one floating point instruction appears during the busy time or ramp down time. 
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(c) 



Fig. 1. The relationship of states. 



to reduce the performance penalty, the FPU will ramp up immediately without 
reaching the inactive state as shown in Figure 1(c). 

A prolonged busy time helps to exploit the spatial and temporal locality of 
operations. If it is set to zero, the FPU will be ramped down immediately after 
it reaches the subactive state. This will introduce larger performance penalty 
also due to the following observation: the floating point instruction may be ex- 
ecuted out-of-order and cannot get into the FPU within the prediction time. 
On the other hand, the infinite busy time keeps the FPU in the subactive state 
and never powers it down. It may increase the performance, but the average 
power dissipation of FPU will increase a lot since the FPU always consumes the 
subactive power even in the idle state. 

2.3 Ramp Up/Down FPU Based on Instruction Prediction 

Ramp up/down FPU based on instruction prediction is a more general mechanism 
to reduce the step power. The main idea is to prefetch one more instruction 
per cycle from the I-cache and predict whether the FPU is required by this 
instruction (IFQ scanning is clearly a simple case of this mechanism.) This will 
help FPU to ramp up gradually in advance and make it available for use in time. 
Our current implementation is to scan the instruction with address PC + N, 
where PC is the current program counter and N is a user decided parameter. We 
will ramp up/down the FPU based on the prediction of this instruction as we 
do in the first method. In this way, we can have TV -|- 2 cycles (still including the 
two cycles between IF and EXE stages) to ramp up/down FPU with a further 
reduced step power. We define this IV -|- 2 as prediction time, which can also be 
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viewed as the ramping time for the FPU to power on/off gradually between the 
inactive and the subactive states shown in Figure 1 (c). 

Further, there is one power step between the subactive state and active state. 
Therefore, there are prediction time+1 = N+S power steps between the inactive 
and active states. We define the step power in this paper as follows: 

step power = {Pa — Pi) / {prediction time + 1) (1) 

= {Pa-P,)/{N+i) 

For example, if Pi is O.IPq and N = 6, we can achieve a step power of O.lPa, 
with an 88.9% reduction compared to the conventional step power of 0.9Po- As 
the IFQ scanning is a simple case of the PC + N instruction prediction with 
N = 0, the FPU in this case has a step power of O.SPq. 

Because the SimpleScalar is an out-of-order execution superscalar processor, 
the predicted instructions may stall in the dispatch stage due to data or resource 
hazard. This will cause the predicted instructions to be executed at a different 
time. Extra stall cycles have to be inserted until the FPU reaches the subactive 
state and becomes available again, which will introduce performance penalty. On 
the other hand, when PC+N instruction prediction is used, there may be branch 
instructions among these N instructions and branch misprediction may happen. 
We currently assume that all the branch instructions are not taken, therefore we 
do not need any extra circuit and keep the cost minimum. Nevertheless, we can 
utilize other existing branch prediction techniques [18,19] to reduce misprediction 
and achieve better results, but with a higher hardware overhead. 

We proceed to present the quantitative results on performance and step 
power, using the SPEC95 FP benchmark programs in the next section. 



3 The Implementation Method 

We summarize the ramping up/down algorithm based on the NP+N prediction 
in Figure 2. To ramp up/down the FPU based on instruction prediction, one 
more instruction will be fetched and predecoded in the IF stage for every clock 
cycle. In this case, a small predecode and control logic is needed. Two counters 
count_fpuJmsytime and count_fpujramptime are used to count up/down the busy 
time and prediction time respectively. A logic signal signaLfpujramp will be used 
to indicate the state of ramping up or ramping down. 

As shown in Figure 2, when the floating point instruction is detected in the 
instruction fetch stage, signaLfpujramp will be set to ramp-up and sent to the 
scheduler stage. The counter count-fpu-husytime will be reset to 0. If there is 
no floating point instruction executed by the FPU, counCfpu-husytime will start 
counting. When it reaches busy time, signaLfpujramp is changed to ramp_down 
and counLfpuLbusytime will be reset again. 

The scheduler stage will keep checking the status of signal_fpu_ramp. If it is 
set to ramp_up, countjfpu-ramptime will be used to ramp up FPU from the inac- 
tive state to the subactive state within prediction time by increasing the workload 
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The Microarchitectural Level Instruction Prediction Algorithm : 

/* In the INSTRUCTION FETCH stage */ 

Prefetch instruction pwr_instr that is N cycles later; 

Predecode this instruction pwrjnstr; 
if (pwrjnstr is FP instruction) { 

signal -fpu-r amp — ramp-up-, //start to ramp up FPU 
reset count_fpu_busytime; 

} else if (FPU is in the subactive state) { 

if (count_fpu_busytime busytime) { 

signal. fpu.ramp — ramp.down; / /start to ramp down FPU 
reset count_fpu_busytime; 

} else 

count _f pu.busytime + +; 

} 

/* In the SCHEDULER stage ♦/ 

if {signal.fpu.ramp ramp.up) { 

if (FPU reaches the subactive state) 

FPU Js.available = 1; 

else { 

FPU Js_available — 0; 
count./ pu. ramptime + +; 

} 

} else if {signal. fpu.ramp ramp.down) { 

if {count.fpu.ramptime > 0) { 

FPU Js_available — 0; 
count.fpu.ramptime ; 

} 

} 

/* In the EXECUTION stage */ 

if (the instruction is floating point instruction) 
if (FPU_is_available) 

issue FP instruction to FPU; 

else { 

stall FP instruction; 

start to ramp up FPU immediately; 

} 

} 



Fig. 2. Microarchitecture Level Instruction Prediction Algorithm 



of FPU gradually. The FPU is ready for execution in the subactive state. On 
the other hand, if signalTpu_ramp is set to ramp^down, count-fpu-ramptime will 
be decremented to ramp down FPU gradually and FPU is not available then. 

In the execution stage of simplescalar, floating point instruction is issued to 
FPU only when the FPU is available, which is decided in the scheduler stage. 
Otherwise, floating point instruction has to stall and waits for the FPU ramp- 
ing up. If FPU is in the inactive or ramping down state and a new floating 
point instruction appears, the FPU starts to ramp up immediately to reduce 
performance penalty. 

In terms of circuit implementation, both the clock network and FPU can be 
partitioned into subcircuits. One subcircuit will be enabled or disabled per cycle 
during ramping up or ramping down via clock gating as discussed in [20,21]. 

4 The Experiment Results 

In this section, SPEC95 FP benchmark programs are used to study the per- 
formance impacts of the two FPU step power reduction techniques. The per- 
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formance is presented by instructions per cycle (IPC). We use the performance 
without any ramping up and down as the base to measure the performance loss, 
and summarize the system configuration used in our experiment in Table 1. 



4.1 Impact of Busy Time 

Figure 3 shows the performance loss in the IFQ scanning mechanism. In this 
figure, the constant prediction time is two, the constant step power is 0.3Pa, but 
the busy time varies from five to fifteen. As expected, the IPC loss is reduced 
when the busy time is increased, because the FPU has more time to stay in the 
subactive state and better prepares for the execution of floating point instruc- 
tions. The IPC loss is less than 2.0% for the FP programs, when busy time is 
ten clock cycles. 

By using the PC + N instruction prediction mechanism with a prediction 
time of eight, we can achieve the step power of O.lPa (88.9% reduction compared 
to the conventional design with only active and inactive power states) . The busy 
time varies from five to fifteen clock cycles in Figure 4. Again, the performance 
penalty is smaller when the busy time increases. The performance loss is less 
than 3.0% when the busy time is larger than ten clock cycles. However, the 
longer FPU stays in the subactive state, the more power it consumes, as the 
subactive power is much larger than the inactive power. Therefore, there exists 
a tradeoff between performance penalty and average power reduction. It can be 
controlled by the user defined busy time. Because the performance loss becomes 
smooth and less than 3.0% when the busy time is larger than ten cycles, we 
choose ten clock cycles as the busy time in the remaining of this section. 



Table 1. System configuration for experiments. 



Functional Unit 


number 


Latency 


Integer-ALU 


4 


1 


Integer-MULT/DIV 


1 


3/20 


Floating Point Adder 


4 


2 


Floating Point MULT/DIV 


1 


4/12 


Fetch/decode/issue width 


4 


Memory bus width 


8 


Cache 


nsets 


bsize 


assoc 


Repl 


I-Ll 


512 


32 


1 


LRU 


D-Ll 


128 


32 


4 


LRU 


I-TLB 


16 


4096 


4 


LRU 


D-TLB 


32 


4096 


4 


LRU 


U-L2 


1024 


64 


4 


LRU 
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Fig. 3. Performance penalty versus busy time in IFQ scanning with a prediction time 
of two. 
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Fig. 4. Performance penalty versus busy time for PC + N instruction prediction. The 
prediction time is eight. 



4.2 Impact of Prediction Time 

Figure 5 reflects the relationship between performance loss and prediction time. 
The step power is 0.225 Pq by setting N = 1 and prediction time = 3. This 
prediction leads to less than 1.2% performance penalty for all benchmarks. Gen- 
erally, when the required step power becomes smaller, a bigger N is needed as 
the FPU needs more time to ramp up/down. The potential chance of mispre- 
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Fig. 5. Performance penalty versus prediction time for PC + N instruction prediction. 
The busy time is ten cycles. Prediction time = 2 stands for IFQ scanning. 



diction will increase, and more performance loss may be induced. For example, 
if the required step power is O.lPa as often required in the real design, we can 
set N = 6 and predicton cycle = 8. The performance penalty is increased but is 
still relatively small. The performance loss is less than 3.0% for all benchmarks. 
Clearly, there exists a tradeoff between the performance penalty and step power 
reduction. The tradeoff can be adjusted by the prediction time. As shown in Fig- 
ure 5, benchmarks turbSd and swim are sensitive to the prediction time. It may 
be due to our branch prediction scheme, and is worthwhile further investigation. 

An alternative to achieve O.lPa without using ramping up (and therefore 
without performance loss) is to set the inactive power Pi = 0.9Pq. Dummy 
operations are used to keep this level of inactive power when no instruction really 
needs the FPU. This leads to a huge amount of non-necessary power dissipation. 
With our prediction technique, the FPU can only consume the leakage power in 
the inactive state. As the leakage power is about O.lPa for the current process, 
we can achieve a factor of nine in terms of power saving in the inactive state. 
Note that even the leakage power can be saved if the power gating is used to cut 
off the power supply. 

4.3 Comparison with Previous Work 

In the following, we compared our performance loss with that in Tiwari’s work 
[9,2], where the pipeline is forced to install in order to ramp up the inactive 
FPU. In Figures 6 and 7, light-colored bars are performance losses without using 
prediction based on our own implementation of the approach in [9,2], and dark- 
colored bars are performance losses using our IFQ scanning prediction technique 
in figure 6 and N-cycle instruction prediction technique in figure 7. The busy 
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Performance Loss Comparison 
(Ramping Time = 2 cycies) 




Fig. 6. Performance Loss Comparison between IFQ Scanning prediction and non- 
prediction for SPEC95 FP programs (ramping time is two cycles). 




Fig. 7 . Performance Loss Comparison between N-cycle instruction prediction and non- 
prediction for SPEC95 FP programs (ramping time is eight cycles). 



time is ten for both figures, and the prediction time (same as ramping time) is 
two and eight in Figures 6 and 7, respectively. One can see that our prediction 
techniques achieve much better performance. When the prediction time is two, 
the average performance loss is 1.96% without prediction, but is only 0.90% 
with prediction. When the prediction time is eight, the average performance loss 
is 4.65% without prediction, but is only 1.08% with prediction. The prediction 
reduced the performance loss by a factor of more than 4 in both cases. 
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5 Conclusions and Discussions 

Based on an extended SimpleScalar toolset and the SPEC95 benchmark set, a 
preliminary study has been presented at the microarchitecture level to reduce the 
inductive noise by ramping up/down the floating point unit. Instruction fetch 
queue (IFQ) scanning and PC + N instruction prediction have been proposed 
to reduce the performance loss due to ramping up and ramping down. Our 
techniques are able to satisfy any given constraint on the step power, and reduce 
the performance loss by a factor of more than 4 when compared with a recent 
work without prediction. 

We assume that the ramping time is same as the prediction time in this paper. 
In general, the two times may be different to further reduce the performance loss. 
It is part of our ongoing work to find the optimal prediction time for the ramping 
time determined by a given step power constraint. Further, we are investigating 
the power impact of ramping up individual adders and multipliers. We also plan 
to apply the prediction mechanism to other functional units such as the integer 
ALU and Level-2 cache. Recent progress can be found at our group webpage 
http://eda.ece.wisc.edu. 
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Abstract. Increasing power dissipation has become a major constraint 
for future performance gains in the design of microprocessors. In this 
paper, we present the circuit design of an issue queue for a superscalar 
processor that leverages transmission gate insertion to provide dynamic 
low-cost configurability of size and speed. A novel circuit structure dy- 
namically gathers statistics of issue queue activity over intervals of in- 
struction execution. These statistics are then used to change the size of 
an issue queue organization on-the-fly to improve issue queue energy and 
performance. When applied to a fixed, full-size issue queue structure, the 
result is up to a 70% reduction in energy dissipation. The complexity of 
the additional circuitry to achieve this result is almost negligible. Fur- 
thermore, self-timed techniques embedded in the adaptive scheme can 
provide a 56% decrease in cycle time of the CAM array read of the is- 
sue queue when we change the adaptive issue queue size from 32 entries 
(largest possible) to 8 entries (smallest possible in our design). 



1 Introduction 

The out-of-order issue queue structure is a major contributor to the overall 
power consumption in a modern superscalar processor, like the Alpha 21264 and 
Mips RIOOOO [1,2]. It also requires the use of complex control logic in deter- 
mining and selecting the ready instructions. Such complexity, besides adding to 
the overall power consumption, also complicates the verification task. Recent 
work by Gonzalez et ah, [3,4] has addressed these problems, by proposing design 
schemes that reduce either the control logic complexity [3] or the power [4] with- 
out significantly impacting the IPC performance. In [3], the authors propose and 
evaluate two different schemes. In the first approach, the complexity of the issue 
logic is reduced by having a separate ready queue which only holds instructions 
with operands that are determined to be fully available at decode time. Thus, 
instructions can be issued in-order from this ready queue at reduced complex- 
ity, without any associative lookup. A separate first-use table is used to hold 
instructions, indexed by unavailable operand register specifiers. Only those in- 
structions that are the first-time consumers of these pending operands are stored 
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in this table. Instructions which are deeper in the dependence chain simply stall 
or are handled separately through a separate issue queue. The dependence link 
information connecting multiple instances of the same instruction in the first-use 
table is updated after each instruction execution is completed. At the same time, 
if a given instruction is deemed to be ready it is moved to the in-order ready 
queue. Since none of the new structures require associative lookups or run-time 
dependence analysis, and yet, instructions are able to migrate to the ready queue 
as soon as their operands become available, this scheme significantly reduces the 
complexity of the issue logic. 

The second approach relies on static scheduling; here, the main issue queue 
only holds instructions with pre-determined availability times of their source 
operands. Since the queue entries are time-ordered (due to known availabilities), 
the issue logic can use simple, in-order semantics. Instructions with operands 
which have unknown availability times are held in a separate wait queue and get 
moved to the main issue queue only when those times become definite. In both 
approaches described in [3], the emphasis is on reduction of the complexity of 
the issue control logic. The added (or augmented) support structures in these 
schemes may actually cause an increase of power, in spite of the simplicity and 
elegance of the control logic. In [4], the main focus is on power reduction. The 
issue queue is designed to be a circular queue structure, with head and tail 
pointers, and the effective size is dynamically adapted to fit the ILP content 
of the workload during different periods of execution. The work in [4] leverages 
previous work [5,6] in dynamically sizing the issue queue. In both [3] and [4], the 
authors show that the IPC loss is very small with the suggested modifications 
to the issue queue structure and logic. Also, in [4], the authors use a trace- 
driven power-performance simulator (based on the model by Cai [7]) to report 
substantial power savings on dynamic queue sizing. However, a detailed circuit- 
level design and simulation of the proposed implementations are not reported in 
[3] or [4]. Without such analysis, it is difficult to gauge the cycle-time impact or 
the extra power/complexity of the augmented design. 

In our work, we propose a new adaptive issue queue organization and we 
evaluate the power savings and the logic overhead through actual circuit-level 
implementations and their simulation. This work was done as a part of a research 
project targeted to explore power-saving opportunities in future, high-end pro- 
cessor development within IBM. Our scheme is simpler than that reported in [3, 
4] in that it does not introduce any new data storage or access structure (like 
the first-use table or the wait queue in [3]). Rather, it proposes to use an existing 
framework, like the CAM/RAM structure commonly used in the design of issue 
queues [8]. However, the effective size of the issue queue is dynamically adapted 
to fit the workload demands. This aspect of the design is conceptually similar 
to the method proposed in [4] but our control logic is quite different. 
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2 Power and Performance Characteristics of a 
Conventional (Non-adaptive) Issue Queue 

The purpose of the issue queue is to receive instructions from the dispatch stage 
and forward ready instructions to the execution units. An instruction is ready 
to issue when the data needed by its source operands and the functional unit 
are available or will be available by the time the instruction is ready to read the 
operands, prior to execution. 

Many superscalar microprocessors, such as the Alpha 21264 [1] and Mips 
RIOOOO [2] use a distributed issue queue structure, which may include separate 
queues for integer and floating point operations. For instance in the Alpha 21264 
[9], the issue queue is implemented as flip-flop latch-based FIFO queues with a 
compaction strategy, i.e., every cycle, the instructions in the queue are shifted to 
fill up any holes created due to prior-cycle issues. This makes efficient use of the 
queue resource, while also simplifying the wake-up and selection control logic. 
However, compaction entails shifting instructions around in the queue every 
cycle and depending on the instruction word width may therefore be a source of 
considerable power consumption. Studies have shown that overall performance is 
largely independent of what selection policy is used (oldest first, position based, 
etc.) [10]. As such, the compaction strategy may not be best suited for low power 
operation; nor is it critical to achieving good performance. So, in this research 
project, an initial decision was made to avoid compaction. Even if this means 
that the select arbitration must be performed over a window size of the entire 
queue, this is still a small price to pay compared to shifting multiple queue 
entries each cycle. 

Due to the above considerations, a decision was made to use a RAM/CAM 
based solution [8] . Intuitively, a RAM/CAM would be inherently lower power due 
to its smaller area and because it naturally supports a non- compaction strategy. 
The RAM/ CAM structure forms the core of our issue queue design. The op-code, 
destination register specifier, and other instruction fields (such as the instruction 
tag) are stored in the RAM. The source tags are stored in the CAM and are 
compared to the result tags from the execution stage every cycle. Once all source 
operands are available, the instruction is ready to issue provided its functional 
unit is available. The tag comparisons performed by the CAM and the checks to 
verify that all operands are available constitute the wakeup part of the issue unit 
operation. While potentially consuming less power than a flip-flop based solution, 
the decision of using a RAM/CAM structure for the issue queue is not without 
its drawbacks. CAM and RAM structures are in fact inherently power hungry 
as they need to precharge and discharge internal high capacitance lines and 
nodes for every operation. The CAM needs to perform tag matching operations 
every cycle. This involves driving and clearing high capacitance tag-lines, and 
also precharging and discharging high capacitance matchline nodes every cycle. 
Similarly, the RAM also needs to charge and discharge its bitlines for every read 
operation. Our research on low-power issue queue designs was focused on two 
aspects: (a) Innovating new circuit structures, which reduce power consumption 
in the basic CAM/RAM structure; and (b) Dynamic adaptation of the effective 
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CAM/RAM structure by exploiting workload variability. This paper describes 
the work done on the second aspect. However, dynamic queue sizing can degrade 
CPI performance as well. Part of the design challenge faced in this work was to 
ensure that the overall design choices do not impact performance significantly, 
while ensuring a substantial power reduction. 

Non-adaptive designs (like the RIOOOO and Alpha 21264) use fixed-size re- 
sources and a fixed functionality across all program runs. The choices are made 
to achieve best overall performance over a range of applications. However, an in- 
dividual application whose requirements are not well matched to this particular 
hardware organization may exhibit poor performance. Even a single application 
run may exhibit enough variability that causes uneven use of the chip resources 
during different phases. Adaptive design ideas (e.g., [5]) exploit the workload 
variability to dynamically adapt the machine resources to match the program 
characteristics. As shown in [5], such ideas can be used to increase overall perfor- 
mance by exploiting reduced access latencies in dynamically resized resources. 

Non-adaptive designs are inherently power-inefficient as well. A fixed queue 
will waste power unnecessarily in the entries that are not in use. Figure 1 shows 
utilization data for one of the queue resources within a high performance pro- 




Fig. 1. Histogram of valid entries for an integer queue averaged over SPECint95 



cessor core when simulating the SPECint95 benchmarks. From this figure, we 
see that the upper 9 entries contribute to 80% of the valid entry count. Dynamic 
queue sizing clearly has the potential of achieving significant power reduction 
as other research has demonstrated as well [4,6]. One option to save power is to 
clock-gate each issue queue entry on a cycle by cycle basis. However, clock gating 
alone does not address some of the largest components of the issue queue power 
such as the CAM taglines, the RAM/ CAM precharge logic, and RAM/ CAM 
bitlines. So a scheme which allows shutting down the queue in chunks based on 
usage reductions to address these other power components can produce signifi- 
cant additional power savings over clock gating. This idea forms the basis of the 
design described in this paper. 
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3 Adaptive Issue Queue Design 

In this section, we discuss the adaptive issue queue design in detail. First, we 
describe the high-level structure of the queue. Then, we present partitioning 
of the CAM/RAM array and the self-timed sense amplifier design. Finally, we 
discuss the shutdown logic that is employed to configure the adaptive issue queue 
at run-time. 

3.1 High-Level Structure of Adaptive Issue Queue 

Our approach to issue queue power savings is to dynamically shut down and re- 
enable entire blocks of the queue. Shutting down blocks rather than individual 
entries achieves a more coarse-grained precharge gating. A high-level mechanism 
monitors the activity of the issue queue over a period of execution called the cycle 
window and gathers statistics using hardware counters (discussed in section 3.3). 
At the end of the cycle window, the decision logic enables the appropriate control 
signals to disable and enable queue blocks. A very simple mechanism for the 
decision logic in pseudocode is listed below. 

if (present_IPC < factor * last_IPC) 
increase_size ; 

else if (counter < threshold_l) 
decrease_size ; 

else if (counter < threshold_2) 
retain_current_size ; 
else increase_size ; 

At the end of the cycle window, there are four possible actions. The issue queue 
size is ramped up to next larger queue size of the current one if the present IPC 
is a factor lower than the last IPC during the last cycle window. This guarding 
mechanism attempts to limit the performance loss of adaptation. Otherwise, 
depending on the comparison of counter values with certain threshold values 
the decision logic may do the following: i) increase issue queue size by enabling 
higher order entries ii) retain the current size, or iii)decrease the size by disabling 
the highest order entries. Note that a simple NOR of all the active instructions 
in a chunk ensures that all entries are issued before the chunk is disabled. 

3.2 Partitioning of the RAM/CAM Array and Self-Timed Sense 
Amplifiers 

The proposed adaptive CAM/RAM structure is illustrated in Figure 2. The ef- 
fective sizes of the individual arrays can be changed at run-time by adjusting 
the enable inputs that control the transmission gates. For our circuit-level im- 
plementation and simulation study, a 32-entry issue queue is assumed which 
is partitioned into four 8-entry chunks. For the taglines, a separate scheme is 
employed in order to avoid a cycle time impact. A global tag-line is traversed 
through the CAM array and its local tag-lines are enabled/disabled depending 
on the control inputs. The sense amplifiers and precharge logic are located at 
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Fig. 2. Adaptive CAM/RAM structure 



the bottom of both arrays. Another feature of the design is that these CAM and 
RAM structures are implemented as self-timed blocks. The timing of the struc- 
ture is performed via an extra dummy bitline within the datapath of CAM/RAM 
structures, which has the same layout as the real bitlines. A logic zero is stored 
in every dummy cell. A reading operation of the selected cell creates a logical 
one to zero transition on the dummy bitline that controls the set input of the 
sense amplifier. (Note that the dummy bitline is precharged each cycle as with 
the other bitlines.) This work assumes a latching sense amplifier that is able 
to operate with inputs near Vdd. When the set input is high, a small voltage 
difference from the memory cell passes through the NMOS pass gates of the 
sense amplifier. When the set signal goes low, the cross-coupled devices amplify 
this difference to a full rail signal as the pass gates turn off to avoid the cross- 
coupled structure from bitlines load. When the issue queue size is 8, a faster 
access time is achieved because of the 24 disabled entries. The self-timed sense 
amplifier structure takes advantage of this feature by employing the dummy bit- 
line to allow faster operation, i.e., the dummy bitline enables the sense amplifiers 
at the exact time the data becomes available. Simulations show that one may 
achieve up to a 56% decrease in the cycle time of the CAM array read by this 
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method. Therefore, downsizing to a smaller number of entries results in a faster 
issue queue cycle time and saves energy, similar to prior work related to adaptive 
cache designs [11,12,13]. However, in this paper we do not explore options for 
exploiting the variable cycle time nature of the design, but focus only on its 
power-saving features. 



3.3 Shutdown Logic 

A primary goal in designing the shutdown logic is not to add too much overhead 
to the conventional design in terms of transistor count and energy dissipation. 
Table 1 shows the complexity of the shutdown logic in terms of transistor count. 



Table 1. Complexity of shutdown logic in terms of transistor count 



Issue Queue 
Number of Entries 


Transistor Counts 
Issue Queue 


Transistor Counts 
Shutdown Logic 


Complexity of 
Shutdown Logic 


16 


28820 


802 


2.8% 


32 


57108 


1054 


1.8% 


64 


113716 


1736 


1.5% 


128 


227092 


2530 


1.1% 



From this table it is clear that the extra logic adds only a small amount of 
complexity to the overall issue queue. AS/X [14] simulations show that this ex- 
tra circuitry dissipates 3% of the energy dissipated by the whole CAM/RAM 
structure on average. Figure 3 illustrates the high-level operation of the shut- 
down logic. It consists of bias logic at the first stage followed by the statistics 
process&storage stage. The activity information is first filtered by the bias logic 
and then it is fed to the process&storage stage where the information is fed to 
counters. At the end of the cycle window, this data passes through the decision 
logic to generate the corresponding control inputs. 

The 32-entry issue queue is partitioned into 8-entry chunks that are sepa- 
rately monitored for activity. The bias logic block monitors the activity of the 
issue queue in 4-entry chunks. This scheme is employed to decrease the fan-in 
of the bias logic. The bias logic simply gathers the activity information over 
four entries and averages them over each cycle. The activity state of each in- 
struction may be inferred from the ready flag of that particular queue entry. 
One particular state of interest is when exactly half of the entries in the mon- 
itored chunk are active. One alternative is to statically choose either active or 
not active in this particular case. Another approach is to dynamically change 
this choice by making use of an extra logic signal variable. (See Adaptive Bias 
Logic in Figure 3.) 
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Fig. 3. High-level structure of shutdown logic and logic table for bias logic 



Statistics Process & Storage 




NA:Not Active A:Active 



Fig. 4. Statistics process and storage stage for shutdown logic 



The statistics process&storage stage, which is shown in Figure 4, is comprised 
of two different parts. The detection logic provides the value that will be added 
to the final counter. It gathers the number of active chunks from the bias logic 
outputs and then generates a certain value (e.g., if there are two active 8-entry 
chunks, the detection logic will generate binary two to add to the final counter). 
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The second part, which is the most power hungry, is the flip-flop and adder pair 
(forming the counter) . Each cycle, this counter is incremented by the number of 
active clusters (8 entry chunks). In this figure one can also see the function of 
the detection logic. The zeros in the inputs correspond to the non-active clusters 
and the ones to active clusters. The result section shows, which value in binary 
should be added. For 32 entries, two of these detection circuits and a small three- 
bit adder are required to produce the counter input. One of the detection logic 
units covers the upper 16 entries and the other one covers the bottom 16 entries. 



4 Simulation Based Results 

In this section, we first present circuit-level data and simulation results. Later, we 
discuss microarchitecture-level simulation results that demonstrate the workload 
variability in terms of issue queue usage. 



4.1 Circuit-Level Data 

Figure 5 shows the energy savings (from AS/X simulations) achieved with an 
adaptive RAM array. (Note that in this figure only positive energy savings 
numbers are presented.) There are several possible energy/performance tradeoff 
points depending on the transistor width of the transmission gates. A larger tran- 
sistor width results in less cycle time impact, although more energy is dissipated. 
The cycle time impact of the additional circuitry did not affect the overall tar- 
get frequency of the processor across all cases. (This was true also for the CAM 
structure.) By going down to 0.39um transistor width, one can obtain an energy 
savings of up to 44%. These numbers are inferred from the energy dissipation 
corresponding to one read operation of a 32-entry conventional RAM array and 
that of various alternatives of the adaptive RAM array. (The size of the queue 
is varied over the value points: 8, 16, 24 and 32.) An interesting feature of the 
adaptive design is that it achieves energy savings even with 32 entries enabled. 
This is because the transmission gates in the adaptive design reduce the signal 
swing therefore resulting in less energy dissipation. 

The adaptive CAM array energy and delay values are presented in Figure 6 
and Figure 7, respectively, for various numbers of enabled entries and transmis- 
sion gate transistor widths. These values account for the additional circuitry 
that generates the final request signal for each entry (input to the arbiter logic) . 
With this structure, a 75% savings in energy dissipation is achieved by down- 
sizing from 32 entries to 8 entries. Furthermore, the cycle time of the CAM 
array read is reduced by 56%. It should be noted that a 32 entry conventional 
CAM structure consumes roughly the same amount of energy as the adaptive 
CAM array with 32 entries. Because the CAM array dissipates ten times more 
energy than the RAM array (using 2.34um transmission gate transistor width) a 
75% energy savings in the CAM array corresponds to a 70% overall issue queue 
energy savings (shutdown logic overhead is included). 
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Adaptive RAM Array 




Fig. 5. Adaptive RAM array energy savings 



Adaptive CAM Array 




Fig. 6. Adaptive CAM array energy values 



4.2 Microarchitecture-Level Simulation and Results 

The work reported thus far in this paper demonstrates the potential power sav- 
ings via dynamic adaptation of the issue queue size. In other words, we have 
designed a specific, circuit-level solution that allows the possibility of such adap- 
tation; and, we have quantified, through simulation, the energy savings potential 
when the queue is sized downwards. In our simulations, we have always factored 
in the overhead of the extra transistors, which result from the run-time resizing 
hardware. 
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Adaptive CAM Array 




Fig. 7. Adaptive CAM array delay values 



In this section, we begin to address the following issues: (a) what are some of 
the alternate algorithms one may use in implementing the decision logic referred 
to earlier (see section 3.1)? That is, how (and at what cycle windows) does one 
decide whether to size up or down? (b) What are the scenarios under which one 
scheme may win over another? (c) How does a simple naive resizing algorithm 
perform from a performance and energy perspective, in the context of a given 
workload? 

The issue unit (in conjunction with the upstream fetch/decode stages) can be 
thought of as a producer. It feeds the subsequent execution unit(s) which act as 
consumer(s). Assuming, for the moment, a fixed (uninterrupted) fetch/decode 
bandwidth, the issue queue will tend to fill up when the issue logic is unable 
to sustain a matching issue bandwidth. This could happen because: (a) the 
program dependency characteristics are such that the average number of ready 
instructions detected each cycle is less than the fetch bandwidth seen by the re- 
ceiving end of the issue queue; or, (b) the execution pipe backend {the consumer) 
experiences frequent stall conditions (unrelated to register data dependencies), 
causing issue slot holes. This latter condition (b) could happen due to exception 
conditions (e.g., data normalization factors in floating point execution pipes, or 
address conflicts of various flavors in load/store processing, etc.). On the other 
hand, the issue-active part of the queue will tend to be small (around a value 
equal to the fetch bandwidth or less) if the consuming issue-execute process is 
faster than or equal to the producing process. Obviously, this would happen 
during stretches of execution when the execution pipe stalls are minimal and 
the issue bandwidth is maximal, as plenty of ready instructions are available 
for issue each cycle. However, one may need a large issue queue window just to 
ensure that enough ready instructions are available to maximize the issue band- 
width. On the other hand, if the stretch of execution involves a long sequence 
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of relatively independent operations, one may not need a large issue queue. So, 
it should be clear, that even for this trivial case, where we assume an uninter- 
rupted flow of valid instructions into the issue queue, the decision to resize the 
queue (and in the right direction: up or down) can be complicated. This is true 
even if the consideration is limited only to CPI performance, i.e., if the objective 
is to always have just enough issue queue size to meet the execution needs and 
dependency characteristics of the variable workload. If the emphasis is more on 
power reduction, then one can perhaps get by with a naive heuristic for size 
adaptation, provided the simulations validate that the average IPC loss across 
workloads of interest is within acceptable limits. 

To illustrate the basic tradeoff issues, first, we provide data that shows the 
variation of CPI with integer issue queue size across several SPEC2000 integer 
benchmarks (see Figure 8). We used SimpleScalar-3.0 [15] to simulate an ag- 




Fig. 8. CPI sensitivity to issue queue size 



gressive 8-way superscalar out-of-order processor. The simulator uses separate 
integer and floating point queues. The simulation parameters are summarized in 
Table 2. The data in Figure 8 shows that for most of the benchmarks simulated, 
there is considerable variation in CPI as integer issue queue size varies between 8 
and 32. In order to gain insight into the potential of our adaptive issue queue, we 
implemented the algorithm discussed in Section 3.1 in SimpleScalar. We chose 
a cycle window size of 8K cycles, as this provided the best energy performance 
tradeoff compared with the other cycle windows that we analyzed. We ran each 
benchmark for the first 400 million instructions. 

Our dynamic algorithm picks the appropriate size for the next cycle window 
by estimating the usage in the last cycle window, and comparing this value 
with certain threshold values. The algorithm also compares the IPC of the last 
interval with the present interval IPC. For this purpose, we also analyzed the 
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Table 2. Simplescalar simulator parameters 



Branch predictor 


comb, of bimodal and 2-level Gag 


Fetch and Decode Width 


16 instructions 


Issue Width 


8 


Integer ALU /Multiplier 


4/4 


Floating Point ALU /Multiplier 


2/2 


Memory Ports 


4 


LI Icache, Dcache 


64KB 2-way 


L2 unified cache 


2MB 4-way 



configurations with different factor values. Threshold values are adjusted such 
that, if the issue queue utilization for a certain size is at the border value of its 
maximum size (e.g., for an issue queue size of 8 entries, the border is 7 entries) 
then the issue queue size is ramped up to the next larger size. Figure 9 shows what 




Fig. 9. Percentage of utilization for each queue size with the dynamic adaptation 



percentage of the time each queue size was used with the dynamic algorithm with 
factor set to be 0.7. Table 3 shows the energy savings and CPI degradation for 
each benchmark as well as the overall average with different factor values. These 
different factor values represent different energy /performance tradeoff points. 
To estimate the energy savings, we assumed an energy variation profile which is 
essentially linear in the number of entries, based on the circuit-level simulation 
data reported earlier in Figure 6. We also take into account the shutdown logic 
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Table 3. Energy savings and CPI degradation with different factor values 



factor 




bzip 


gcc 


mcf 


parser 


vortex 


Vpr 


average 


0.9 


CPI degradation % 


0.0 


1.6 


0.3 


0.9 


3.3 


0.0 


1.0 


Energy Savings % 


14.0 


43.3 


-3.0 


6.4 


43.6 


5.1 


18.2 


0.8 


CPI degradation % 


0.0 


4.3 


0.3 


1.0 


4.8 


0.0 


1.7 


Energy Savings % 


15.9 


53.9 


-3.0 


8.9 


51.2 


14.8 


23.6 


0.7 


CPI degradation % 


0.0 


8.6 


0.3 


1.7 


6.8 


1.4 


3.1 


Energy Savings % 


26.6 


61.3 


-3.0 


13.7 


58.1 


33.6 


31.7 



overhead. CPI degradation and energy savings are both relative to a fixed 32- 
entry integer issue queue. 

The results from Figure 9 demonstrate the broad range of workload vari- 
ability. For mcf, the full 32 entry queue is used throughout its entire execution 
whereas for vortex and gcc, only 8 and 16 entries are largely used. For bzip, 
the algorithm almost equally chooses issue queue sizes of 32, 24, and 16 entries. 
For parser, the 32 and 24 entry issue queue configurations dominate whereas 
for vpr, 24 or 16 entries are largely used. On average, this very naive algorithm 
provides a 32% decrease in the issue queue energy (61% maximum) with a CPI 
degradation of just over 3% with factor set to be 0.7. 



5 Conclusion 

We examine the power saving potential in the design of an adaptive, out-of- 
order issue queue structure. We propose an implementation that divides the 
issue queue into separate chunks, connected via transmission gates. These gates 
are controlled by signals which determine whether a particular chunk is to be 
disabled to reduce the effective queue size. The queue size control signals are 
derived from counters that keep track of the active state of each queue entry 
on a cycle-by-cycle basis. After a (programmable) cycle window, the decision to 
resize the queue can be made based on the activity profile monitored. The major 
contribution of this work is a detailed, circuit-level implementation backed by 
(AS/X) simulation-based analysis to quantify the net power savings that can be 
achieved by various levels of queue size reduction. We also simulated a dynamic 
adaptation algorithm to illustrate the scenarios where the resizing logic would 
size the queue up or down, depending on the particular priorities of performance 
and energy. 

Future work includes exploring alternate hardware algorithms for queue-size 
adaptation, pursuing improvements at the circuit level that provide better con- 
figuration flexibility, and investigating methods for exploiting the self-timed issue 
queue capability. 
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Abstract. In this paper we propose several system-level transforma- 
tions that allow to rednce the dynamic memory requirements of complex 
real-time multi-media systems. We demonstrate these transformations 
on the protocol layer of the MPEG4 IMl-player. As a conseqnence, up 
to 20% of the global power consumption of the protocol subsystem can 
be eliminated, which is significant due to the programmable processor 
target. The entire MPEG4 description is assumed to be mapped on a 
heterogeneous platform combining several software processors and hard- 
ware accelerators. 



1 Motivation and Context 

Rapid evolution in sub-micron process technology allows ever more complex sys- 
tems to be integrated on one single chip. However, design technologies fall behind 
these advances in processing technology. A consistent system design technol- 
ogy that can cope with such complexity and with the ever shortening time-to- 
market requirements is in great need. It should allow to map these applications 
cost-efficiently to the target architecture while meeting all real-time and other 
constraints. The design of these complex systems starts with the specification 
of all its requirements. This is mostly done by a very heterogeneous group of 
people originating from different communities. They often formalize the specifi- 
cations into an international standard. Normally, they simultaneously build an 
executable model of the specified system without investing too much effort in 
the efficiency of the implementation. Afterwards, this executable model is then 
translated by another group of designers into a working system. They manually 
try to improve the initial implementation. As a result of the large complexity and 
size of these systems, this mostly happens in a very ad hoc and time-consuming 
way. However, rapidly changing features and standards enforce the design tra- 
jectory to be shorter. Hence, less time is left to improve the initial specification, 
resulting in suboptimal, power hungry and not very retargetable systems. 
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As a consequence, one can conclude that there is a need for a formalized 
methodology that allows us to systematically improve the initial task-level spec- 
ification. Therefore, we propose a methodology that starts from an initial, unified 
specification model that is effectively capable of representing the right system- 
level abstraction [16,7]. We then systematically transform this model in order 
to reduce the cost of dynamically allocated data and to increase the amount of 
concurrency available in the application. We mainly target the protocol layer 
of multi-media and network applications which are typically mapped to power- 
hungry programmable processors. The high throughput but more regular and 
mostly fixed data processing kernels (e.g. IDCT in video-decoders) are assumed 
to be mapped to accelerators based om more custom processors. This model can 
then finally be used to map the improved specification on a ‘platform’. In this 
paper we will focus on the code transformation stage oriented to the memory 
subsystem. That subsystem contributes heavily to the overall power and area 
cost. The effect of these transformations on the dynamic memory requirements 
of the system will be demonstrated, when mapping it on a heterogeneous low 
power platform that combines parallel software processors and hardware accel- 
erators. By reducing the amount of dynamically allocated memory needed by 
the applications with our transformations, either the quality of service of the 
system can be improved or the power consumption and the area (size) of the 
system can be reduced. 

2 Target Application 

The target applications of our task-level system synthesis approach are advanced 
real-time multi-media systems. These applications involve a combination of com- 
plex data- and control-flow where complex data types are manipulated and trans- 
ferred. Most of these applications are implemented on portable devices, putting 
stringent constraints on the degree of integration and on their power consump- 
tion. Secondly, these systems are extremely heterogeneous in nature and combine 
high performance data processing (e.g. data processing on transmission data in- 
put) as well as slow rate control processing (e.g. system control functions), syn- 
chronous as well as asynchronous parts, analog and digital, and so on. Thirdly, 
time-to-market has become a critical factor in the design phase. Finally, these 
systems are subjected to stringent real-time constraints (both hard and soft 
deadlines are present), complicating their implementation considerably. 

The main driver for our research is the IM 1-player [8]. This player is based on 
the MPEG4 standard, which specifies a system for the communication of inter- 
active audio-visual scenes composed of objects. The player can be partioned into 
four layers (see Fig. 1): the delivery layer, the synchronization layer, the compres- 
sion layer and the composition layer. The delivery layer is a generic mechanism 
that conveys streaming data to the player. Input streams flow through the dif- 
ferent layers and are gradually interpreted. We mainly focus on the compression 
layer. This layer contains several decoders. Two important ones are the BIFS ^ 

^ Binary Format for Streams 
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Fig. 1. System level modules of the MPEG4 IMl-player 



and the OD ^ decoder. The BITS and OD decoders interpret the data packets 
that describe the composition of the scene objects. The output of these decoders 
is a graph. Each node in this graph represents an object in the scene The edges 
represent the relation between the different objects on the scene. However, to 
represent some of the scene objects extra AV-data is needed. This data is con- 
veyed in other streams to the compression layer where it is interpreted by the 
appropriate decoders (e.g. a JPEG-stream is uncompressed by a JPEG-decoder). 
In this paper we assume that an arbitrary number (x) of wavelet-encoded images 
can be transmitted to the compression layer. The output of these decoders is 
finally rendered on the screen of the embedded device by the composition layer. 
In order to increase the throughput, the system is pipelined with large, dynam- 
ically allocated buffers as indicated in Fig. 1. These buffers are the main focus 
of this paper. 



3 Related Work 

By transforming the memory behavior of the initial specification, we hope to 
obtain at the end an improved embedded system. In the domain of memory 
management, it is know that the time and space constraints can be enhanced by 
improving the collaboration between the application and the memory manager. 
Therefore, much effort has been spent spent in measuring the behavior of memory 
management [22,23] and then developing improved memory managers both in 
software [24] and hardware [27,21] using the profiling information to reduce the 
memory space and to meet the real-time constraints. In most cases, however, they 
did not consider to change the behavior of the application. Hence, we believe 
that our transformations are complementary to the existing literature. 

In Matisse [25,2], a design-flow for telecommunication systems, we have pro- 
posed a two step dynamic memory management (DMM) approach to improve 
the dynamic memory requirements of the original system. First, based on in- 
formation extracted by profiling the application with a relevant input set, the 
data types are refined and their sizes are estimated. Afterwards, several memory 

^ Object Descriptor 
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managers are tried out in order to explore different power and access count re- 
lated cost functions. In this part of our design-flow, we are thus also changing the 
mutator. However, we did not consider modifying the CDFG of the application 
and we have up to now not focussed on multi-media applications that are loop 
and array based instead of the table access oriented network applications in [2] . 

It has been recognized quite early in the compiler theory (for an overview 
see [5]) and high-level synthesis that in front of the memory organization re- 
lated tasks, it is necessary to perform transformations. Otherwise, the memory 
organization and the data transfer and storage cost (DTS) will be heavily sub- 
optimal. Therefore, we have proposed formal transformation strategies for power 
and storage size reduction in [4,3,26]. The main focus in this previous work is 
static data flow and loop transformations to increase the amount of parallelism, 
to improve the data- locality and to decrease the amount of buffering needed. In 
our current design-flow, these transformations are executed after the DMM step, 
when all the storage sizes can be determined statically. Hence, the influence of 
dynamically allocated data is not taken into account. We believe that, on the 
one hand, some of the known data-flow transformations can also be applied on 
dynamically allocated data. E.g. shifting of “delay-lines” and re-computation 
issues (see [26]). On the other hand, as a result of the non- determinism implied 
with the use of dynamically allocated memory, we can identify some new and 
complementary transformations, improving the DTS cost, which are the main 
topic of this paper. 

4 The Grey-Box Abstraction Level 

Because of the complexity and the size of the applications in our target domain 
we have proposed to look at a new abstraction level to model these embedded 
systems, with emphasis on performance and timing aspects while minimizing 
the memory and processor cost. In contrast to existing task- level approaches we 
will neither work at the detailed white-box task model (e.g. [17,19,18]) where too 
much information is present to allow a thorough exploration, nor at the black-box 
task model (e.g. [10,11,12,13,15,9]), where insufficient information is available to 
accurately steer even the most crucial cost trade-offs. 

In [6,16] we have concluded that this new abstraction level only needs to 
capture essential data types, control/data- flow and dynamic constructs. On this 
grey-box model, we can apply transformations for both the dynamic memory 
management step and task concurrency management step [6,1] in our design- flow 
[14]. In this paper, we will mainly demonstrate several system-level transforma- 
tions focussed on dynamic memory (DM) improvement in the IMl-player. 

5 Reducing the Dynamic Memory Reqnirements 

When developing an initial model of a system, specification engineers normally 
spend neither time nor energy to accurately use dynamic memory. They are 
mainly concerned with the functionality and the correctness of the system. As 
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a result, the software model is often very memory inefficient. However, for the 
design engineers it afterwards becomes an almost unmanageable and at least 
an error-prone process to improve the dynamic memory usage due to dangling 
pointers and memory leaks. Therefore, they will either redesign the system from 
scratch or use the same system but increase the memory, which results in a more 
power hungry architecture. In 5.1 we will present two formalizable transforma- 
tions that allow us to systematically reduce the amount of dynamic memory 
required by the application. The benefit of these transformations can be ex- 
ploited on different multi-processor platforms. Therefore, we will discuss in 5.2 
the influence of the platform on the DM requirements of an application and we 
will explain how these can be relaxed. 



5.1 Control/Data-Flow Transformations Enabling a Reduction of 
the Dynamic Memory Requirements 

With the first transformation we reduce the life-time of the dynamically allocated 
memory by allocating the memory as late as possible and deallocating it as soon 
as possible. As a result of our approach, we are able to increase the storage 
rotation, which can significantly reduce the amount of memory required by the 
application. Indeed, in most applications memory is allocated in advance and 
only used a long time after. This design principle is applied to obtain flexible 
and well structured code which is easy to reuse. Although this can be a good 
strategy to specify systems on a powerful general purpose platform, the extra 
memory cost paid is not acceptable for cost-sensitive embedded systems. This 
allocation policy could also be implemented using a garbage collector. However, 
transforming the source code instead of relying on a run-time garbage collector 
to implement this results in a lower power consumption. The transformation 
acts as an enabling step for the second transformation and the task-ordering 
step discussed below. 

By postponing the allocation we are able to gather more accurate information 
on the precise amount of memory needed. This information can be used to opti- 
mize the amount of memory allocated for each data type. The ratio behind this 
transformation is that specification engineers often allocate conservatively large 
amounts of memory. In this way, they try to avoid cumbersome and difficult de- 
bugging. They hereby forget that it is much harder for implementation engineers 
to re-specify afterwards the amount of memory as they do not have the same 
algorithmic background. For the same reason, automation of this transformation 
is particularly hard. 

We have applied both transformations on the dynamically allocated buffers 
in the IMl-player (see Fig. 1). These buffers are used to store the input and 
output of the compression layer. The size of the input buffers is transmitted in 
the OD data packets; the size of the output buffers is fixed at design-time and 
differs for each decoder type. 

We have profiled the time-instants at which each buffer is (de) allocated and 
is first(last) used. All the input buffers are allocated and freed at the same time. 
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However, on the original one processor-platform, the buffers are used sequen- 
tially. As a consequence, by reducing the life-time of these buffers it is possible 
to overlap them and significantly decrease the amount of memory needed. This 
principle is illustrated on the left part of Fig. 2. 





Memory usage of the 
original Implementation 
of the dynamic data structures 






After decreasing life-time 
and reordening 

of the dynanic data structures 



Time 

After allocating 
data dependent size 



Fig. 2. DM oriented transformations illustrated on the input buffers of the IMl-player 



By analyzing the memory accesses to the allocated memory, we have found 
out that a large part of the dynamically allocated output buffers is never ac- 
cessed. By applying our first transformation, we are able to reduce the output 
buffer size by making it data dependent. Instead of using worst case sizes, the 
more average data dependent sizes required by the input stream can now be 
used (see right part of Fig. 2). 

5.2 Constraints on Static and Dynamic Task- Schedule Reducing 
the Dynamic Memory Cost on a Multi-processor Platform 

After improving the characteristics of the application with the transformations 
mentioned above, its tasks still need to be assigned and scheduled on a ’platform’. 
By constraining the task-schedule of the application (e.g. scheduling the task that 
requires the largest amount of memory first) it is possible to avoid consecutive 
(de)allocations of the same memory (see [20]). In this way, not only some memory 
accesses can be avoided (and as a result power saved), but also the fragmentation 
of the memory can be reduced. 

In addition, when exploiting the parallelism inside the application, the 
amount of memory needed by the application will normally increase propor- 
tional to the amount of processors. This relation can be improved by avoiding 
that two tasks both using a large amount of memory, are executed simultane- 
ously. Our transformations applied prior to this step will relax the constraints 
on the task-schedule. 

We will illustrate this with the same example as used in subsection 5.1. Ini- 
tially, it is useless to reorder the different tasks to decrease the DM cost of the 
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input-buffers. All buffers have the same life-time and no gain can be made by 
reordering the different tasks. After applying the enabling transformations dis- 
cussed in subsection 5.1, we obtain more freedom to reorder (or constrain) the 
task-schedule to decrease the DM cost. E.g. we can avoid decoding simultane- 
ously the wavelet decoders that require much memory. As a consequence, the 
total amount of required memory increases less than proportional to the amount 
of parallelism (see Fig. 3). 



No constraints 
on schedule 



I No constraints 
on schedule 






raiL 

|T4|j|^T5f 



M«mory cost 
Schedule 



(gHT2W0><5)A3>*® 



Assuming a two processor platform 
with one hardware accelerator 



Constraints 
on schedule 




Assuming a two processor platform 
with two hardware accelerators 



Fig. 3. Constraints on the task-schedule reducing the DM requirements of a system 



6 Experiments and Results 

In this section we will prove quantatively the results of our transformation 
methodology. In a first part we will present the benefit of our transformations on 
the input buffers. In a second part we will briefly explain the potential benefit 
of our methodology on the output buffers. 

We will schedule the IM 1-player on a representative platform consisting of 
two software processors with extra hardware accelerators added to decode the 
AV-data. The system layer of the player consists of three tasks running on the 
software processors, i.e. the delivery layer, BIFS- and OD decoder. The two par- 
allel SAllO processors running both at IV are used to illustrate the scheduling. 
The main constraint for scheduling is the time-budget, i.e. 30ms to render one 
frame. The two SAllO processors consume 40mW each. The energy consumed 
by the hardware accelerators is estimated based on realistic numbers that we 
obtained by measuring the a wavelet decoding oriented chip that we have devel- 
oped at IMEC. The cost of the memory is calculated based on recent memory- 
models supplied by vendors (Alcatel 0.35u DRAM for the embedded memory 
and Siemens Infineon DRAM for the external memory). 
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In a first experiment we have scheduled the IMl-player using only one hard- 
ware accelerator. We have added afterwards an extra one to improve the perfor- 
mance of the system and decode more objects per frame. We have compared all 
our results with the performance of the initial specification mapped on the same 
platform. 

The results in Tab. 2 and 1 clearly show the potential benefit of the trans- 
formations of subsection 5.1 as they allow us to reduce the global power 
cost (including both processor and memory cost) with almost 20%. Moreover, 
this methodology still can be applied on other parts of the player and does not 
need to be restricted to the input buffers only. We have obtained this power gain 
because we are able to significantly reduce the amount of memory required by 
the application. As a consequence, the data can be stored on embedded DRAM 
instead of on off-chip DRAM. We believe that two reasons make our approach 
scalable to future applications: if embedded DRAM would become available that 
is large enough to contain the initial data structures, our transformations will 
still be needed because the power consumption of embedded DRAM scales at 
least super logarithmic with its size. Moreover, we believe that the amount of 
data required by the applications will increase at least proportional to the size 
of embedded DRAM. 

As indicated in subsection 5.2, the buffer size would normally increase pro- 
portionally with the amount of parallelism. By constraining the schedule this 
can be avoided. The results for a two hardware accelerator context are repre- 
sented in Tab. 2. One can also derive from Tab. 1 and Tab. 2 that the fraction 
of the memory cost increases in the total energy budget when more parallelism 
is exploited in the system. This illustrates the importance of these constraints 
on highly parallel platforms. 

In this final paragraph we present the effect of our transformations on the 
output buffers. We are able to significantly reduce the size of these buffers, i.e. 
from 750kB to 105kB. Consequentely, it further reduces the power cost of the 
memory accesses and allows us to embed the IMl protocol on a portable device. 



7 Conclusion 

In this paper we have presented several systematic data/control flow transfor- 
mations for advanced real-time multi-media applications which reduce the power 
(area) cost of the dynamically allocated memory or increase the performance of 
the system. The first transformation has reduced the life-time of the dynamically 
allocated data structures. As a result, the memory rotation could be significantly 
increased which reduces the required amount of memory. It also functioned as 
an enabling step for the second transformation, in which we have defined more 
accurately the amount of memory needed for each data structure. Finally, we 
have given some constraints which help to relax the effect of parallel execution of 
tasks on the increasing buffer cost. We have illustrated these techniques with ex- 
amples extracted from a real multi-media application, the MPEG4 IMl-player. 
They allowed us to reduce the energy consumption of the memories with at least 
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Table 1. Execution times and processing energy of most important tasks in IMl-player 
on a heterogeneous platform with one and two hardware accelerators 





1 HW accelerator 


2 HW accelerators 


Execution 

Time 


Processor 

Energy 


Execution 

Time 


Processor 

Energy 


OD 


9ms 


0.37mJ 


9ms 


0.37ms 


BIFS 


24ms 


0.98mJ 


24ms 


0.98mJ 


Delivery 


8.1ms 


0.32mJ 


16.2ms 


0.64mJ 


Wavelet 


30ms 


ImJ 


30ms 


2mJ 


Total 


30ms 


2.58mJ 


30ms 


4mJ 



Table 2. Memory requirements and memory energy of most important tasks in IMl- 
player on a heterogeneous platform with one and two hardware accelerators 





1 HW accelerator 


2 HW accelerators 


Mem. 

Accesses 


Mem. 
Size Pre 


Mem. 
Size Post 


Mem. 

Accesses 


Mem. 
Size Pre 


Mem. 
Size Post 


OD 


0.58k 


lOkB 


0.58kB 


0.58kB 


lOkB 


0.58kB 


BIFS 


2.41k 


41kB 


2.41kB 


2.41k 


41kB 


2.41kB 


Delivery 


35.9k 


35.9Bk 


12.4kB 


71.8k 


71.8kB 


17kB 


Wavelet 


35.9k 


35.9kB 


12.4kB 


71.8k 


71.8kB 


17kB 


Total 


74.9k 


86.9kB 


14.8k 


146k 


193kB 


19.4k 



1 HW accelerator 


2 HW accelerators 


Energy Pre 


Energy Post 


Energy Pre 


Energy Post 


0.78mJ 


0.16mJ 


1.54mJ 


0.19mJ 



a factor of 5. As a consequence, the global power cost of a significant part of this 
particular application has decreased with 20% without even touching the other 
subsystems. 
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Abstract. Real-time 3D graphics will be a major power consumer in fu- 
ture portable embedded systems. 3D graphics is characterized by inten- 
sive floating point calculations, heavy data traffic between memory and 
peripherals, and a high degree of non-uniformity due to content varia- 
tion. Although 3D graphics is currently very limited in power-constrained 
environments, future user interfaces (head-mounted virtual reality) and 
applications (simulators, collaboration environments, etc.) will require 
power-aware implementations. Fortunately, we can exploit content vari- 
ation and human perception to significantly reduce the power consump- 
tion of many aspects of 3D graphic rendering. In this paper we study 
the impact on power consumption of novel adaptive shading algorithms 
(both Gouraud and Phong) which consider both the graphics content 
(e.g. motion, scene change) and the perception of the user. Novel dynami- 
cally configurable architectures are proposed to efficiently implement the 
adaptive algorithms in power-aware systems with gracefully degradable 
quality. 

This paper introduces an integrated algorithm and hardware solution 
based on human visual perceptual characteristics and dynamically re- 
configurable hardware. Three approaches are explored which are both 
based on human vision and loosely analogous to video coding techniques. 

The first approach is called distributed computation over frames and ex- 
ploits the after image phenomenon of the human visual system. The 
second approach exploits visual sensitivity to motion. According to the 
speed and distance from camera to object, either the Gouraud or Phong 
shading algorithm is selected. The third approach is an adaptive com- 
putation of the specular term computation used in Phong. Using the 
same selection criteria as in adaptive shading, a reduced computational 
cost algorithm is used for fast moving objects. Results based on simu- 
lation indicate a power savings of up to 85% using short but realistic 
rendering sequences. Future work includes: 1) more sophisticated archi- 
tectures to support dynamic reconfiguration, 2)exploring other steps in 
the 3D graphics pipeline, and 3) extending these ideas to other multime- 
dia applications which involve variable content, computation and human 
perception (for example, video and audio coding). 

* This work has been partially supported by the National Science Foundation under 
Award CCR-99882388. 
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1 Introduction 

3D graphics hardware has been receiving great attention recently [27] due to 
the emergence of 3D graphics in many applications, such as movie making [11], 
virtual reality (VRML) [8], 3D games, and others. Two main architectural ap- 
proaches are typically considered for 3D graphics computing platforms: 1) pow- 
erful parallel programmable devices and 2) custom silicon. However, the migra- 
tion of complex graphics systems into personal and portable products such as 
wearable computers, palm and laptop computers, and 3D graphics heads-up dis- 
plays will require an increased emphasis on low-power design, much as it has in 
portable video and audio systems. 

For 3D graphics, programmable devices give flexibility, but have high-cost 
and lower energy efficiency than custom silicon. Since real-time computer graph- 
ics requires both powerful floating point and fixed point arithmetic units, fast 
memories and a high speed data path between computing units and memory, 
it is generally supported by a special purpose co-processor board with various 
application-specific chip designs [25,22,24,20,6,23]. Custom silicon ASICs limit 
the graphics system by not exploiting the significant amount of variation in 
computational load across a range of graphics tasks due to the complexity of the 
scene, level of detail, motion, lighting and photorealism effects. By introducing 
dynamically reconfigurable hardware to a 3D graphics engine, the gap between 
the two conventional approaches can be narrowed, and performance and power 
constraints can be achieved by trading off rendering quality. 

Human visual perceptual characteristics have been used widely in the re- 
search area of video and graphics. In the movie industry, movie film uses only 
24 frames/second without a frame by frame strobing effect. This low frame rate 
can be achieved by exploiting the after image phenomenon of the human vi- 
sual system, double flashing for each frame and dark room environment. The 
lasting time, however, varies over a wide range due to the light intensity of the 
environment and of the object being watched by the observer. Video coding has 
used human visual perception in a number of ways to reduce the amount of 
video data, such as color perception and spatial-temporal frequency perception. 
Some research groups studying 3D graphics rendering tried to give more realis- 
tic graphic image by special effect algorithms, such as motion blur [28], depth 
of field [37], visual adaptation [32] and others. Motion blur and depth of field 
algorithms are used to mimic camera system. The visual adaptation algorithm 
is used to mimic the human visual system sensitivity to quick change of scene 
intensity. This paper notes that human visual perceptual characteristics can be 
used to reduce the amount of computation in 3D rendering while maintaining 
a reasonable quality of image. The reduction of computation is achieved by ex- 
ploiting both the graphics content, such as motion and scene change, and the 
visual perception of the viewer, such as the low sensitivity to moving objects 
and the after image phenomenon. 

In this paper, we propose three dynamically adaptive approaches to 3D ob- 
ject shading algorithms:, Gouraud shading [14] and the higher quality Phong 
[33] shading. Shading consumes a substantial part of the rendering computation 
and involves many perceptual issues and is thus an appropriate area to look 
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for significant power savings. The first approach exploits the after image phe- 
nomenon of the human visual system that lasts about 1/8 second (4 frames in 
a 30f/sec image sequence). Due to the slow response of the visual system, some 
of the terms in the shading computation are invariant over a number of frames 
and can be stored or looked up in a table rather than computed for every frame. 
The second approach exploits the fact that the human visual perception has 
relatively low sensitivity to the rendered quality of moving objects. Since Phong 
shading requires much more computation but gives higher quality than Gouraud 
shading, one of these two shading methods can be chosen according to the speed 
of the object being rendered. Another criterion for choosing the shading method 
is the distance between the object and viewer. Applying different algorithms for 
the specular term of Phong shading is the last approach introduced in this paper. 
Since the specular term uses a weighting coefficient that is cos(o;)", where a is 
the angle between the viewing direction and the specular direction, and n is the 
specularity of the material, the computational cost grows at least logarithmically 
in n. 

2 Background and Related Works 

2.1 3D Graphics Rendering System 

A typical 3D computer graphics pipeline is depicted in Fig. 1 [13]. The first step, 
the application, depends on the host CPU computing power. The second step, 
he geometric engine, does intensive floating point arithmetic, thus it needs mul- 
tiple pipelined floating point arithmetic units and/or specific arithmetic function 
units, such as a floating point 4x4 matrix multiplier. The final step, the ras- 
terization engine, requires many high-speed fixed point arithmetic units and a 
costly high speed dedicated memory system. Except for the application step, 
the time cost for the two steps can be expressed as 9{n + Oc) [18], where n is 
the number of primitives and ac is the sum of screen areas of clipped primitives. 
This expression shows that the cost is linear in the complexity of the scene and 
also implies that a rendering system with a certain computing power can only 
achieve a certain level of quality. 




Fig. 1. 3D Graphics Pipeline 



Shading belongs to the rasterization engine and is the process of performing 
lighting computations and determining the corresponding colors of each pixel 
[29]. The lighting equation for multiple light sources is: 

n 

I = kala + IMN ■ U) + ■ V)*] 

2 = 1 



( 1 ) 
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the ith light source, ka is the ambient-reflection coefflcient, kd is the diffuse- 
reflection coefficient, kg is the specular-reflection coefficient, N is the unit normal 
surface vector, is the unit vector directed toward the ith light source, Ri is 
the specular reflection unit vector of Li, V is the unit vector directed toward the 
viewer, and s is the specular reflection parameter. The first term is the ambient 
light component, the second term is the diffuse light component, and the last 
term is the specular light component. 

Gouraud shading uses the ambient and diffuse terms of Equation 1. For each 
vertex of a triangle, the Gouraud lighting equation is computed and then a 
Discrete Difference Analyzer(DDA) algorithm is applied to compute each pixel’s 
light intensity within a triangle. The DDA algorithm consumes 3 subtractions, 
two division, s two multiplications and one addition. 

Phong shading uses the DDA algorithm to compute a normal vector N for 
each pixel and uses Equation 1 for every pixel in a triangle. With the specular 
term, Phong shading gives a more realistic image for a shining surface object 
than Gouraud shading. However, the computation cost of Phong shading is much 
higher than Gouraud shading, because of the computation of the specular reflec- 
tion unit vector Ri and the specular reflection parameter exponentiation. 



2.2 Reconfigurable Computing 

The differences between a general purpose microprocessor and a custom silicon 
solution can be described in terms of cost, design cost, flexibility, computing 
speed, and energy efficiency. A microprocessor has much more flexibility, lower 
speed for a certain task, and lower energy efficiency than custom silicon. And 
the gap of those differences is large, at least two orders of magnitude for speed 
and energy. Recently, reconfigurable hardware (or devices), such as Field Pro- 
grammable Gate Arrays (FPGA) [26,4,5], have provided an alternative to close 
that gap by offering a compromise between flexibility and performance /efficiency. 
Reconfigurable hardware actually can be superior to custom silicon and general 
purpose microprocessors for a highly regular computing task in which the essen- 
tial characteristics is semi-static. 

FPGAs can be classified as fine-grained programmable device. Unfortunately 
this fine-grained programmability comes at a cost in circuit efficiency, speed and 
power. It also requires a very large overhead in time, power and storage for recon- 
figuration. To overcome the limitations of bit-level granularity, several research 
projects [12,17,10,1] have been investigating architectures which are based on 
wider data paths. These architectures have a coarser granularity and offer less 
flexibility but perform well for computations that are well matched to this gran- 
ularity (e.g. image processing, string matching, digital signal processing). More 
information about reconfigurable computing for signal processing applications 
can be found in [38]. 3D graphics does not have significant history as an appli- 
cation of reconfigurable computing, however the basic computations, real-time 
requirements and I/O intensity are common to those of signal processing. 

By introducing dynamic reconfigurable hardware to 3D graphics hardware, 
it could ease the bottleneck of step 2 and 3 of Fig. 1, for example, variable size 
matrix multiplier for step 2 and reconfigurable embedded memory system for 
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step 3. Consequently, high end system can get speed up in computing time while 
keeping the flexibility, and low end system can get flexibility without loosing 
more than a bit of speed comparing to fixed hard-wired one. The flexibility 
means a lot in computer graphics hardware, for example, the flexibility make 
it possible to implement a adaptive algorithm for a certain function that could 
lead graphics system to produce higher quality without losing computing speed. 

In addition to the flexibility, reconfigurable hardware could be a good candi- 
date for 3D graphics computing by introducing hard-wired arithmetic function 
blocks. For example, some parameters used to compute a light intensity of a 
pixel point could be considered as constant for a certain period time, such as 
viewer position, directional light source position, object’s surface parameters and 
etc. For the viewer position, this position vector is not changed at least for one 
frame time, 33ms for 30 Frame/Sec system, and is used to compute the specular 
term of Phong shader. With these semi-constants, 3D graphics computing is well 
suited to reconfigurable hardware. 



2.3 Related Works 

Research areas in 3D graphics related to this paper include novel algorithms to 
exploit levels of detail(LOD), effects like motion blurring, and shading algorithm 
hardware implementations. LOD generation algorithms [31,7,15,19] based on the 
distance between object and camera, have been explored to reduce the rendering 
computation and the amount of polygonal information for transmission through 
network and for storage. These algorithms reduce the rendering computation 
for Gouraud shading significantly. Phong shading, however, can not get benefits 
in terms of computation reduction as much as Gouraud shading since Phong 
shading’s computation is linear in the number of pixels occupied by an object 
independent of it’s distance from the viewer. 

Due to the high computation cost of rendering, real-time 3D rendering is still 
very limited in achievable frame rate. With low frame rates, a strobing effect of 
moving objects is noticeable. In order to reduce this artifact, motion blurring 
techniques [16,40,28,21] are introduced. In addition to the artifact reduction, 
motion blurring generates image sequences similar to a video sequence taken 
by camera. This paper assumes that the rendering engine performs the motion 
blurring. 

Phong shading hardware implementation is difficult due to both computation 
and hardware cost. Thus much of the previous work on Phong shading implemen- 
tation focused on how to reduce the Phong shading computation and to reduce 
the hardware cost. In order to reduce the hardware cost, a fixed point computa- 
tion is proposed by a research group [3] . A Taylor series is used to approximate 
the specular term of Phong shading by some software rendering. Since Taylor 
series hardware implementation requires a large size of lookup table to get a 
reasonable quality of generated image, the hardware cost for Taylor series is too 
high. To overcome this high hardware cost, adaptive quantization of the lookup 
table is proposed by [36] . Another work [35] uses a test of the angle between the 
surface normal vector and the light reflection vector before the specular term 
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computation to detect highlighted polygons. Since this paper proposes a way to 
turn off the specular term of Phong shading for a certain case of image sequence, 
all three previous approaches could be used along with the proposed approach. 



3 Power Estimation and Image Quality Measurement 

3.1 Power Estimation 

In this section, a simple model is introduced to show power savings with adaptive 
shading system. In order to simplify the model and to minimize the computation 
difference between two shading algorithms, the following assumptions are taken: 

1. The same word length for all functional units 

2. Only 1 light source 

3. specular reflection parameter s = 1 

4. For the specular term, N H [2] is used instead of R-V. Since halfway vector 
H between L and V is same for all pixels of an object, while R varies for 
each pixel. 

5. The functional unit adder, multiplier, and divider has the normalized power 
consumption 1, 1.5 and 4, respectively [30](pp. 2-6). 

Fig. 2 shows the basic data flow block diagram of the Gouraud and Phong shader. 



Gouraud Shading 
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Fig. 2. Basic Data Flow of Gouraud and Phong Shader: 
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Simplified power model for Gouraud and Phong shading based on the as- 
sumptions can be expressed by Eq. 2 and Eq. 3, respectively. 

^gouraud — Pdda “t” Pdiff ^ 3 (2) 

Pphong — ^dda X 3 -t“ {Pdiff “t” Pspec) X p (3) 

where Pdda, Pdiff and Pspec are the power consumed by DDA algorithm, diffu- 
sion term and specular term, respectively, p is the number of pixels in a triangle. 

The first term Pdda of Eq. 2 is the power consumed to compute the inner pix- 
els of a triangle given three pixel values computed by the second term. Although 
Pdda of a triangle varies due to the position and shape, the variation could be 
ignored by assuming that all triangles are equilateral. 

Therefore, the Pdda is: 

Pdda = (a; -|- l){Pdiv + Psub) + [p ~ (3a: — 3)]Padd 

= 5(a; -I- 1) -I- (p — 3a; -I- 3) = p -I- 2x -I- 8 (4) 

where x and p are the number of pixels of a side of a triangle and the num- 
ber of pixels within a triangle, respectively. The relationship of x and p can be 
expressed by 0.43x^ -I- 0.13x — 0.56 = p. This x and p relationship equation is 
derived from the simple equilateral triangle trigonometry. We also assume that 
Psub — Padd- Unlike the Pdda of Pgouraud, Pdda term of Eq. 3 is the power con- 
sumed to interpolate normal vector N. Since N has three components(x, y, z), 
Pdda is multiplied by 3. 

From Eq. 1, the diffusion term needs one vector dot product. And one mul- 
tiplication for he diffusion coefficient, and one addition for the ambient term. 
Hence Pdiff = 3 X Pmul -t- 2 X Padd 4- 1 X Pmul 4” Padd = 4 X Pmul 4- 3 X Padd — 9- 
With the assumptions 3 and 4, Pspec is equal to Pdiff — 1 = 8. By substituting 
these numbers into Eq. 2 and Eq. 3, the ratio of Pphong and Pgouraud which is 
consumed to render a triangle with p pixels, is expressed in Eq. 5. Fig. 3 shows 
the power consumption ratio with the different power consumption ratio of adder 
and multiplier. 



Pphong 3(p 4- 2x 4- 8) 4- (9 4- 8)p 

Pgouraud (p 4- 2x 4- 8) 4- (9 X 3) 
20p 4- 6x 4- 24 
p 4- 2x 4- 35 

where p > 3. 



3.2 Image Quality Measurement 

In general, 3D graphics image quality is measured by human observation. For 
example, the level decision criteria of LOD algorithm is pre-decided by trial and 
error even though a method based on human visual system is introduced. In 
order to create criteria for shading selection, this paper borrowed the picture 
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Power Consumption Ratio of Phong/Gouraud 




Triangle size (pxles/triangle) 

Fig. 3. Power consumption ratio of Phong shading and Gouraud shading: one triangle 
shading 



signal to noise ratio(PSNR) equation from video image comparison methods. 
PSNR can not be used alone to decide the relative quality of a given graphic 
image, since the same PSNR of different images does not mean that those images 
have the same quality. Because of this reason, to compare Phong shaded image 
and Gouraud shaded image, human observation is used in addition to the PSNR. 
PSNR is shown in Eq. 6. 

2552^2 

PSNR = 10 log (6) 

where m, R, and C is number of pixels of a image frame, reference image, and 
compared image, respectively. Fig. 4 shows stills from the two objects used for 
the simulation. Although just monochrome stills, the shading artifacts can be 
easily discerned. 



4 Content Adaptive 3D Graphics Rendering 

One of many techniques to reduce rendering computation is layered image ren- 
dering [39]. By using the layered image rendering technique, the refreshing rate 
of a rendered object is reduced. If a rendering system is optimized for object re- 
fresh rate, two content variation cases, scene change and object motion, require 
a refresh of the particular objects affected by those variations. The proposed 
approaches in this paper need the help of the software side application step 
in Fig. 1. The information detection, such as scene change occurrence, the dis- 
tance between camera and object, and moving object speed, and adaptive shader 
control should be done by the application step, since the information detection 
can be done much easier than other steps. Note that this information is readily 
available in 3D graphics while it must be derived in video sequences. 
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(a) (b) 

Fig. 4. Simulation Model: (a) Fighter with 3843 triangles and 2014 vertices (b) Teapot 
with 2257 triangles and 1180 vertices 



4.1 Distributed Computation over Frames 

In the layered image rendering system, scene change is one of the cases pushing 
hard on the system since all objects of all layers should be rendered in one 
frame time. If a scene change requires the most computation overall in rendering 
time, the rendering system clock should be chosen for this case. The after image 
phenomenon of the human visual system can be used to ease the system clock 
by rendering new objects over several frame times with an accumulation buffer 
without many severely noticeable artifacts of the progressive rendering. From 
Eq. 1, the diffusion and specular term for a light source is summed over the 
number of light sources. If there are four light sources, the system clock should be 
fast enough to iterate the diffusion and specular term four times in a frame time. 
As shown in Fig. 5, by adding an accumulation buffer and executing diffusion 
and specular term iterations over four frame times, the system clock can be 
lowered to either 1 /4 or the clock speed or another step of the rendering pipeline 
which becomes the bottleneck. 

If we have local memory to cache the normal vector for each pixel, there is 
no additional computation needed. Thus The power saving obtained from this 
approach could be up to 75% — Pmem for the above example where Pmem is the 
power consumed by memory access and maintenance. The ratio of power saving 
varies according to the contents of images. The hardware cost is 1.85MB memory 
for normal vector caching and 922KB for accumulation buffer for a 640 x 480 
image frame size rendering system with 16 Bit normal vector and the support 
of the hidden surface removal algorithm. Since the hiding of artifacts resulting 
from the progressive rendering over frames could be proved by only displaying 
the image sequence, the results are not shown in this paper. 



4.2 Adaptive Shading on Moving Object 

The two different configurations of shader shown in Fig. 2 are used to shade an 
object with the criteria explained below. The speed and the distance from the 
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Phong Shader with Accumulation Butter 




Fig. 5. Phong Shader with Accumulation Buffer 



camera is used as the information to choose one of Phong and Gouraud shading 
for an object in motion. Since different 3D models have different light reflection, 
surface complexity, color, etc., the shading decision rule of a 3D model is unique 
to other 3D models. In order to make the decision rule of 3D models, a PSNR 
graph with the various speed and distance is generated first for all 3D models. 
Fig. 6 shows the PSNR graph of the Fighter and Teapot 3D models of Fig. 4. 
Next, with the PSNR graph and the visual observation on the screen, find an 
appropriate PSNR number and draw the decision line along the chosen PSNR. 
In Fig. 6 (a), PSNR 33 is chosen for the Fighter model. Although the PSNR 33 
line has a contour shape, in order to make controller simple, a straight line is 
used. In a scene, if the Fighter model’s speed and distance falls in the shaded 
area of Fig. 6, the controller starts to use Gouraud shading. 





Fig. 6. PSNR of Adaptive Shading on Moving Objects: (a)Fighter and (b)Teapot 
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(a) (b) 

Fig. 7 . Blurred Fighter Image (21 pixels/frame): (a) Phong shading (b) Gouraud shad- 
ing 



In Fig. 7, the boxed area shows the difference between Phong and Gouraud 
shading, even though both images are motion blurred, since these two images 
are in still picture frames. When the image sequence generated by adaptive 
shading is played in real time, however, it is hard to notice the difference with 
an appropriate decision rule. The decision rule can be used as a control parameter 
of power saving. Moving the decision line in Fig. 7 to have more shaded area, 
gives an object more chance to fall in shaded area. 



Table 1. Average power saving for 5 different distances 



Distance 


Avg. Pixel/Tri 


Fighter 


Teapot 


Avg. Pixel/Tri 


1 


23.5 


7.35 


10.36 


51.8 


2 


18.8 


6.56 


8.52 


32.1 


3 


15.8 


5.99 


7.10 


21.9 


4 


11.2 


4.94 


5.06 


11.7 


5 


6.7 


3.66 


3.23 


5.4 


Avg. 




5.7 


6.85 




Power saving % 




82.5% 


85.4% 





Table 1 shows how much power can be saved to render Fighter and Teapot 
3D model in fast motion. The Teapot model saves more power than the Fighter 
model since the Fighter model has fewer number of pixels per triangle than the 
Teapot model. In subsection 3.1, the specular reflection parameter s is assumed 
as 1. This means that the power saving percentage could be higher than the 
number of table 1, because the s of the Fighter and the Teapot model is 20 and 
40, respectively. 
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4.3 Adaptive Specular Term Computation 



Due to the high cost of exponential computation of specular term of Eq. 1, 
the proposed approach in this section introduces the reconfigurable exponential 
computing function block in Fig. 2. Due to the variety of implementations of 
exponential computation, we consider the general implementation of exponential 
computation as the reference for which the computational cost grows at least 
logarithmically in s [9]. 

One of the simple ways used to mimic original Phong shading is Eq. 7 from 
[34]. Since this equation requires one multiplication, one subtraction, one ad- 
dition and one division, it is not power efficient for all values of the specular 
reflection parameter s. For small values of s, just using iterative multiplication 
consumes less power. 



Specular derm = ks 



ns 

{N ■ H) - (N ■ H)ns + ns 



(7) 



If the reference exponential computation block requires 8 multiplications 
when s = 256, the power consumption is 8 x 1.5 = 12 while the power con- 
sumption of the fast phong equation is 1 -I- 1 -I- 1.5 -I- 4 = 7.5. These numbers 
yields 37.5% power saving. Fig. 8 shows the difference of Phong and Fast_Phong 
with s = 64 which yields 16.7% power saving. 




(a) (b) 

Fig. 8. Blurred Teapot Image (21 pixels/frame): (a) Phong shading (b) Fast Phong 
shading 



Hence the adaptive specular term computation uses the same criteria and 
one more input parameter s for decision rule making. Fig. 9 shows the PSNR 
graph used to make decision line. Unlike the graph from the adaptive shading, 
the Fast_Phong can be used for all object in motion regardless of the speed and 
depth, since the PSNRs are much higher than those of the adaptive shading. 

5 Conclusion 

In this paper, we showed that graphics image content variation and the visual 
perception of the user can be used as important parameters of reconfigurable 
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Fig. 9. PSNR of Adaptive Specular Term Computation: (a)Fighter and (b)Teapot 



algorithms to reduce the power consumption of 3D graphics rendering system. 
Although the numbers shown in this paper are not detail numbers due to the 
lack of implementation, it is shown that the proposed adaptive shading could 
reduce the power consumption dramatically. Through this paper, the cost of 
system reconfiguration is not discussed. If, however, we assume that interactive 
3D graphics image sequence has similar content variation characteristics of video, 
the low reconfiguration rate would only have a small impact on the power saving 
rate. 

Future work will exploit more content variation and visual perception in 
other steps of the rendering pipeline in order to find ways to save power con- 
sumption, and to develop efficient dynamically reconfigurable hardware to realize 
the adaptive algorithms. 

References 

1. A. Abnous and J. Rabaey. Ultra-Low-Power domain-Specific multimedia processor. 
In IEEE Workshop on Signal Processing Systems, October 1996. 

2. J. Blinn. Simulation of wrinkled surfaces. In SIGGRAPH ’78, pages 286-292, 1978. 

3. C. Chen and C. Lee. A cost effective lighting procesor for 3d graphics application. 
In IEEE International Conference on Image Processing, pages 792-796, October 
1999. 

4. Altera Co. http://www.altera.com/. 

5. Atmel Co. http://www.atmel.com/. 

6. Intergraph Co. http://www.integraph.com/. 

7. J. Cohen, M. Olano, and D. Manocha. Appearance-preserving simplication. In 
SIGGRAPH ’98, July 1998. 

8. VRML Consortium, http://www.vrml.org/. 

9. T. Cormen, C. Leiserson, and R. Rivest. Introduction to Algorithms. McGraw-Hill, 
1990. 

10. A. DeHon. Reconfigurable architecture for general-purpose computing. Technical 
Report 1586, MIT, A.I. Lab, 1996. 

11. Digital Domain, http://www.d2.com/. 

12. C. Ebeling, D.C. Cronquist, and P. Franklin. RaPiD - reconfigurable pipelined 
datapath. In Pield- Programmable Logic: Smart Applications, New Paradigms and 
Compilers, pages 126-135, September 1996. 



64 



J. Euh and W. Burleson 



13. J. Foley, A. van Dam, S. Feiner, and J. Hughes. Computer Graphics: Principles 
and Practice. Addison- Wesley Publishing Company, second edition, 1990. 

14. H. Gouraud. Continuous shading of curved surfaces. IEEE Transactions on Com- 
puter, pages 623-629, June 1971. 

15. A. Gueziec, G. Taubin, F. Lazarus, and W. Horn. Simplical maps for progressive 
transmission of polygonal surfaces. In VRML ’98, February 1998. 

16. P. Haeberli and K. Akeley. Hardware support for high-quality rendering. ACM 
Computer Graphics, pages 309-318, August 1990. 

17. R. Hartenstein and R. Kress. A datapath synthesis system for the reconfigurable 
datapath architecture. In Asia and South Pacific Design Automation Conference, 
August 1995. 

18. P. Heckbert and M. Garland. Multiresolution modeling for fast rendering. In 
Graphics Interface ’94, May 1994. 

19. H. Hoppe. Progressive meshes. In SIGGRAPH ’96, July 1998. 

20. ATI Technology In. http://www.atitech.ca/. 

21. 3dfx Interactive, Inc. http://www.3dfx.com/3dfxtechnology/tbufferwp.pdf. 

22. 3Dlabs Inc. http://www.3Dlabs.com/. 

23. Diamond Multimedia Systems Inc. http://www.diamondmm.com/. 

24. Rendition Inc. http://www.rendition.com/. 

25. S3 Inc. http://www.s3.com/. 

26. Xilinx, Inc. http://www.xilinx.com/. 

27. SIGGRAPH/Eurographics joint Workshop on Graphics Hardware, 
http:/ /hwws.org/. 

28. T. McReynolds, T. Blythe, B. Grantham, and S. Nelson. Programming with 
opengl: Advanced techniques. In SIGGRAPH ’98, July 1998. 

29. T. Moller and E. Haines. Real-Time Rendering. AK Peters, 1999. 

30. A. Nannarelli. Low Power Division and Square Root. Ph.D Dissertation, U.C. 
Irvine, 1999. 

31. R. Pajarola and J. Rossignac. Compressed progressive meshes. IEEE Transactions 
on Visualization and Computer Graphics, 6(l):79-93, 2000. 

32. S. Pattanaik, J. Tumblin, H Yee, and D. Greenberg. Time-dependent visual adap- 
tation for fast realistic display. In SIGGRAPH ’00, July 2000. 

33. B. Phong. Illumination for computer generated pictures. Communications of the 
ACM (CACM), pages 311-317, June 1975. 

34. Chritophe Schlick. Graphics Gems IV. AP Professional, 1994. 

35. B. Shih, Y. Yeh, and C. Lee. An area efficient architecture for 3d graphics shading. 
In IEEE International Conference on Consumer Electronics, pages 462-463, June 
1998. 

36. H. Shin, J. Lee, and L. Kim. A minimized hardware architecture of fast phong 
shader using taylor series approximation in 3d graphics. In IEEE International 
Conference on Computer Design, pages 286-291, October 1998. 

37. J. Snyder and J. Lengyel. Visibility sorting and compositing without splitting for 
image layer decomposition. In SIGGRAPH ’98, July 1998. 

38. R. Tessier and W. Burleson. Reconfigurable computing for digital processing: A 
survey. VLSI Signal Processing, September 2000. 

39. J. Torborg and J. Kajiya. Talisman: Commodity realtime 3d graphics for the pc. 
In SIGGRAPH ’96, 1996. 

40. M. Wloka and R. Zeleznik. Interactive real-time motion blur. In CGI’94: Insight 
Through Computer Graphics, 1995. 



Compiler-Directed Dynamic Frequency and 
Voltage Scheduling* 



Chimg-Hsing Hsu^, Ulrich Kremer^, and Michael Hsiao^ 

^ Department of Computer Science, Rutgers University 
Piscataway, New Jersey, USA 
{chunghsu, uli}@cs . rutgers . edu 

^ Department of Electrical and Computer Engineering, Rutgers University 
Piscataway, New Jersey, USA 
mhsiaoSece . rutgers . edu 



Abstract. Dynamic voltage and frequency scaling has been identified 
as one of the most effective ways to reduce power dissipation. This paper 
discusses a compilation strategy that identifies opportunities for dynamic 
voltage and frequency scaling of the CPU without significant increase 
in overall program execution time. The paper introduces a simple, yet 
effective performance model to determine an efficient CPU slow-down 
factor for memory bound loop computations. Simulation results of a su- 
perscalar target architecture and a program kernel compiled at different 
optimizations levels show the potential benefit of the proposed compiler 
optimization. The energy savings are reported for a hypothetical target 
machine with power dissipation characteristics similar to Transmeta’s 
Crusoe TM5400 processor. 



1 Introduction 

Modern architectures have a large gap between the speeds of the memory and the 
processor. Several techniques exist to bridge this gap, including memory pipelines 
(outstanding re ads/ writes), cache hierarchies, and large register sets. Most of 
these architectural features exploit the fact that computations have temporal 
and/or spatial locality. However, many computations have limited locality, or 
even no locality at all. In addition, the degree of locality may be different for 
different program regions. Such computations may lead to a significant mismatch 
between the actual machine balance and computation balance, typically resulting 
in long stalls of the processor waiting for the memory subsystem to provide the 
data. 

We will discuss the benefits of compile-time voltage and frequency scaling 
for single loop nests. The compiler not only generates code for the input loop, 
but also assigns a clock-frequency and voltage level for its execution. The goal of 
this new compilation techniques is to provide close to the same overall execution 

* This research was partially supported by NSF CAREER award CCR-9985050 and 
a Rutgers University ISC Pilot Project grant. 
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time while significantly reducing the power dissipation of the processor and/or 
memory. The basic idea behind the compilation strategy is to slow down the 
CPU that otherwise would stall or be idle. Frequency reduction and voltage 
reduction may lead to a linear and quadratic decrease of power consumption, 
respectively. In addition, recent work by Martin et al. has shown that reducing 
peak power consumption can substantially prolong battery life [24] . 

1.1 The Cost Model 

The dominant source of power consumption in digital CMOS circuits is the 
dynamic power dissipation (P), characterized by 

P oc CUV 

where C is the effective switching capacitance, V is the supply voltage, and / 
is the clock speed [6]. Since power varies linearly with the clock speed and the 
square of the voltage, adjusting both can produce cubic power reductions, at 
least in theory. However, reducing the supply voltage requires a corresponding 
decrease in clock speed. The maximum clock speed for a supply voltage can be 
estimated as 



(V - VtT 
V 



where Vr is the threshold voltage (0 < Vr < U), and a is a technology de- 
pendent factor (1 < a < 2). Despite the non-linearity between clock speed and 
supply voltage, scaling both supply voltage and clock speed will produce at 
least quadratic power savings, and as a result quadratic energy savings. Figure 1 
gives the relation between clock speed, supply voltage, and power dissipation for 
Transmeta’s Crusoe TM5400 microprocessor as reported in its data sheet [33]. 
For a program running for a period of T seconds, its total energy consumption 
(E) is approximately equal to 



E = Pavg * T 

where Pavg is the average power consumption. 
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0.95 


1.10 


1.25 


1.33 


1.40 


1.50 


1.60 


1.65 


Power 


(%) 


3.0% 


4.7% 


12.7% 


24.6% 


32.5% 


41.1% 


59.0% 


80.6% 


100% 



Fig. 1. The relation between clock frequency, supply voltage, and power dissipation 
of Transmeta’s Crusoe TM5400 microprocessor. The voltage figures for frequencies 
lOOMHz and 70MHz are interpolations and are not supported by the chip. 
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1.2 Voltage Scheduling 

In the context of dynamic voltage scaled microprocessors, voltage scheduling is a 
problem that assigns appropriate clock speeds to a set of tasks, and adjusts the 
voltage accordingly such that no task misses its predefined deadline while the 
total energy consumed is minimized. Researchers have proposed many ways of 
determining ’’appropriate” clock speeds through on-line and off-line algorithms 
[34,14,13,16,28]. The basic idea behind these approaches is to slow down the 
tasks as much as possible without violating the deadline. 

This ”just-in-time” strategy can be illustrated through a voltage scheduling 
graph [27]. In a voltage scheduling graph, the X-axis represents time and the Y- 
axis represents processor speed. The total amount of work for a task is defined by 
the area of the task ’’box” . For example, task 1 in Figure 2 has a total workload of 
8,000 cycles. By ’’stretching” it out all the way to the deadline without change of 
the area, we are able to decrease the CPU speed from 600MHz down to 400MHz. 
As a result, 23.4% of total (CPU) energy may be saved on a Crusoe TM5400 
processor. 



speed[MHz] 
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600 ■ 



speed[MHz] 





; 






! 




i 400 




400 




task 1 

1 1 




1 ^ 


^ ^ 


task 1 

1 1 1 
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task 1 : 

^ ^ ^ L 



10 15 20 time[sec] 



10 15 20 time[sec] 



5 10 15 20 time[sec] 



(a) original schedule. (b) voltage scaled schedule, (c) energy-performance 

tradeoffs. 



Fig. 2. The essence of voltage scheduling. 



In case of a soft deadline, energy can be saved by trading off program per- 
formance for power savings. For example, if the voltage schedule starts with a 
300MHz clock speed for 8 seconds and then switches to 400MHz for 14 seconds, 
the resulting execution time for task 1 is 22 seconds. The schedule pays 10% of 
performance penalty (with respect to the deadline), but it saves 29.1% of total 
energy as compared to the 600MHz case. These estimates assume that there is 
no performance penalty for the frequency and voltage switching itself. 

1.3 Our Contributions 

We propose a simple compile-time model to identify and estimate the maximum 
possible energy savings of dynamic voltage and frequency scaling under the con- 
straint that overall program execution times may only be slightly increased, or 
not increased at all. In many cases, a compiler is able to predict and shape the fu- 
ture behavior of a program and the interaction between large program segments, 
giving compilers an advantage over operating systems techniques. Typically, op- 
erating system techniques rely on the observed past program behavior within 
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a restricted time window to predict future behavior. Preliminary simulation re- 
sults show the effectiveness of our model for optimized and unoptimized loops. 
The impact of various compiler optimizations on energy savings is discussed. In 
summary, we propose a simple model and new compilation strategy for dynamic 
voltage and frequency scaling. 

The rest of the paper is organized as follows: Section 2 presents the simple 
model. Using a single simple benchmark, the effect of various compiler optimiza- 
tions is illustrated in Section 3. Section 5 gives a brief summary of related work, 
and Section 6 concludes the paper. 



2 Compiler-Directed Frequency Scaling 

Consider the simple C program kernel in Figure 3(a). The loop scans a two- 
dimensional array in column-major order, and has no temporal locality (i.e., 
each array element is referred only once) . Array size n is carefully chosen so that 
no spatial locality is present across iterations of the outermost j-loop. The loop 
will have spatial locality (i.e., successively accessed array elements reside in the 
same cache block) only if the array is scanned in row-major order. 

Suppose the program is executed on a hypothetical superscalar machine with 
out-of-order execution, non-blocking loads/stores, and a multi-level memory hi- 
erarchy. ^ 

The graphs shown in Figures 3(b) and (c) illustrate the opportunities for 
dynamic frequency scaling for our program kernel. The unoptimized version is 
heavily memory-bound, allowing a potential processor slow-down of up to a 
factor of 20 without a significant performance penalty. Figure 3(b) shows several 
scaled clock speeds whose relative performance is very close to 100%. These 
scaled speeds are 1/2, 1/5, 1/10, and 1/20 and result in performance penalties 
of less than 1%. If we are able to identify these scaled speeds, CPU energy 
consumption can be reduced by more than one order of magnitude for our target 
architecture, assuming it has an energy characteristics similar to that of a Crusoe 
TM5400 processor. 

Using advance optimizations such as loop interchange, loop unrolling, and 
software prefetching, the computation/memory balance of the example loop can 
be significantly improved. Given that our target architecture has an L2 block size 
of 64 bytes and a single bank memory with a latency of 100 cycles. Figure 3(c) 
shows the best performance possible for the code, i.e., the performance is only 
limited by the physical memory bandwidth of the architecture.^ Even for this 
best case scenario, there is still significant opportunity for voltage and frequency 
scaling with a performance penalty of less than 1%. 

Both examples show that choosing the right slow-down factor is crucial to 
achieve energy savings with minimal performance penalties. In fact, increasing 
the slow-down factor may actually result in overall performance improvements, 

^ More details regarding our target machine can be found in Section 4. 

^ More details regarding the performed optimizations can be found in Section 3. 
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120% 



float A [n] [n] , accu 




130% 



110% 




for (1=0; i<n; 1++) | 

I 100% 

for (i=0; i<n; i++) | 
accu += A[i] [j] ; 



100 % 



90% 



10 15 20 25 

slow-down factor 



80% 



1.4 1.8 2.2 2.6 

slow-down factor 



(a) A simple loop. (b) No loop optimizations, (c) Highly optimized: 

and no data locality. loop interchange & unrolling, 

software prefetching. 

Fig. 3. A simple C program and its unoptimized and optimized performance under 
different CPU clock frequencies. The horizontal lines indicate the threshold for a 1% 
performance degradation. 

a somewhat non-intuitive result. This behavior can be attributed to synchro- 
nization effects between memory and CPU. 

2.1 A Simple Model 

We divide the total program execution time (T) into three portions: 



with the right-hand side entities defined as follows: 

— the CPU is busy while the memory is idle {cpuBusy); this includes CPU 
pipeline stalls due to hazards, 

— the memory is busy and the CPU is stalled while waiting for data from 
memory {memBusy), and 

— CPU and memory are both active at the same time, i.e., are working in 
parallel (bothBusy). 

Consider now the CPU speed is reduced by a factor of 6. Assume that the 
program in the reduced clock speed behaves exactly the same for every program 
step as the program in the normal speed, but only executed in ’’slow motion”.^ 
The new, slowed-down execution time becomes 



In order to have the new execution time very close to the original one, for 
instance Tnew{S)/T < 101%, <5 * cpuBusy needs to be very close to cpuBusy, 



T = cpuBusy + memBusy + bothBusy 




® This may not be the case in practice, for instance due to out-of-order instruction 
execution. 
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and also 6 * bothBusy cannot be too large. Based on the model, we propose two 
conditions 



((5 — 1) * cpuBusy < 1% 



( 1 ) 



, - ^ memBusy 

1 s 



( 2 ) 



such that once they are satisfied, the new execution time is within 1% of the 
original execution time. 

Condition (1) indicates that cpuBusy can be used as a measure of perfor- 
mance penalty. When it is relatively large, the clock speed cannot be slowed 
down without hurting the performance significantly. In addition. Condition (2) 
states that the slow-down factor cannot be arbitrarily large. How much it can be 
slowed down is highly dependent on how much the CPU is stalling due to mem- 
ory requests. The more CPU stalls, the slower we can set the CPU speed. For 
the unoptimized loop in Figure 3(b), simulation produces the following figures: 



cpuBusy 


memBusy 


bothBusy 


0.01% 


93.99% 


6.00% 



From Condition (1) and (2), it can be derived that (5 < 1 -I- 1%/cpuBusy = 
1 -I- 1%/0.01% = 101 and (5 < 1 -I- 93.99%/6.00% = 16.67, respectively. Since 
both conditions need to be satisfied, the maximum slow-down factor suggested 
by the simple model is 17. In other words, the clock speed can be reduced to as 
much as 1/17 without more than 1% performance penalty. 

However, as observed in Figure 3(b), not all CPU speed reductions (< 1/17) 
result in a <1% or less performance slow-down. For example, execution time 
increases by 30.0% when the clock speed is set to 1/12 of the original speed. The 
reason for this significant performance decrease is the mismatch of the memory 
and CPU cycle times, resulting in clock skew effects during synchronization. Our 
model takes this effect into account by introducing a third condition: 

memory latency is divisible by <5 (3) 

Finally, in order to simplify the analysis, we require the forth condition: 

S has an integral value (4) 

As a result, the model correctly identifies speed reductions 1/2, 1/4, 1/5, and 
1/10 that satisfy the deadline constraint. However, the speed reduction by 1/20 is 
not suggested by our model. Possible reasons include the imprecision of CPU and 
memory workload prediction and the ” ideal- world” assumption that program 
behavior remains the same under different clock speeds. For the optimized code 
of Figure 3(c), our model selects the slow-down factor 6 = 2. 



Compiler-Directed Dynamic Frequency and Voltage Scheduling 



71 



(1) Identify program regions as scheduling candidates 

(2) Model expected performance 

(a) Determine cpuBusy, memBusy, and bothBusy 

(b) Compute slow-down factor J using model discussed in Section 2.1 

(3) Generate voltage/frequency scheduling instructions for each scheduliirg 
candidate; adjust performance optimizations, if necessary 



Fig. 4. Outline of basic compilation approach 



2.2 Basic Compilation Strategy 

The basic compilation strategy is show in Figure 4. The granularity of scheduling 
candidates has to be large enough to compensate for the overhead of voltage 
and frequency adjustments. Each scheduling candidate will be assigned a single 
voltage and frequency, allowing dynamic changes of voltage and frequency only 
between scheduling candidates. Initially, we will consider loop nests as scheduling 
candidates that will be analyzed and assigned a frequency and voltage. Possible 
identification of such candidate loop nests include the phase definition introduced 
by Kennedy and Kremer in the context of automatic data layout [21]. 

Different strategies can be used to determine values for the our model param- 
eters cpuBusy, memBusy , and bothBusy. Static compile-time analysis, on and 
off line performance monitoring, or a combination of static and dynamic tech- 
niques are currently under investigation. For this discussion, we assume that the 
values of the three model parameters are available. The main focus of this paper 
is the discussion of a model that is able to select a suitable slow-down factor <5 for 
a deadline constraint dg, given values for cpuBusy, memBusy , and bothBusy. 

The third and last compilation step will insert frequency and voltage adjust- 
ment instructions before each scheduling candidate, i.e., before each candidate 
loop nest. Assuming that the overhead of switching relative to the computa- 
tion within a single loop nest is so small that it can be ignored, the collection 
of solutions for individual loop nests will represent an optimal solution for the 
entire program. We are currently investigating scheduling candidates of a finer 
granularity where the switching overhead is significant. In this case, an optimal 
frequency and voltage assignment requires multiple scheduling candidates to be 
considered at the same time. 

3 The Impact of Compiler Optimizations 

In the following, some of the advanced memory hierarchy optimizations will be 
used to demonstrate the impact of performance-oriented optimizations on the 
possibility of slowing down the CPU speed without noticeable penalty. Such 
optimizations can either reduce the number of memory references (e.g.: loop in- 
terchange, loop tiling), or hide the memory latency (e.g.: loop unrolling, software 
prefetching) . 
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3.1 Techniques to Optimize Workloads 

At the beginning of Section 2, it is mentioned that the array is scanned in column- 
major order while C uses row-major order. As a consequence, the program does 
not have any data locality. Loop interchange can be used to access consecutive 
rows of the array, resulting in spatial locality and a reduction in the total amount 
of memory accesses. Figure 5(a) gives the transformed program through loop 
interchange and its model parameters. 



for (i=0; i<n; i++) 
for (j=0; j<n; j++) 
accu += A [i] [j] ; 



(a) Transformed loop. (b) Its workload. 



cpuBusy 


18.93% 


memBusy 


73.66% 


bothBusy 


7.41% 



Fig. 5. The impact of loop interchange. 



Loop interchange effectively reduces workload of both CPU and memory. 
Since spatial locality is exploited, total memory work is reduced to 1/16 (every 16 
j-iteration will generate a memory access). At the same time, sequential memory 
access pattern simplifies the address computations, and, as a result, 20% of pure 
computations (in instructions) are eliminated. In addition, instructions can be 
grouped more efficiently, and total CPU work (in cycles) is reduced to 1/2. As 
a result, the transformed program speeds up by a factor of 10.73. 

However, according to our model, the transformed program is not a good 
candidate for slowing down the CPU without noticeable performance impact; 
cpuBusy (18.93%) is too large to satisfy Condition (1). The source of the problem 
is that the work of the CPU and memory has little overlap. A careful examination 
of the execution trace reveals that since the resources (RUU units) are used up 
very quickly, only a few j -iterations are issued, and then the fetch process is 
stalled until data arrives from memory. Once these j-iterations are executed, a 
few more j-iterations need to be executed before a new memory request is made. 

Resources are used up very quickly because of the associated overheads in 
every j-iteration. Loop unrolling may be able to alleviate the problem by reduc- 
ing the iteration overheads. Figure 6(a) gives the transformed program with the 
j-loop unrolled 16 times. As a result, many more j-iterations (24 to be exact) 
can be issued before the memory access is completed. New memory requests 
are made before the old memory access is done. In other words, loop unrolling 
has the effect of ’’implicit” data prefetching in our hypothetical machine. This 
explains why cpuBusy is so small (0.67%). 

As discussed in the previous section, loop unrolling is designed to reduce CPU 
workload but it has the side-effect of workload overlapping through ’’implicit” 
data prefetching. Data prefetching can be done explicitly as well. The intention 
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for (i=0; i<n; i++) 

for (j=0; j<n-15; j+=16) { 
accu += A[i] [j] ; 

accu += A[i][j + 15]; } 

(a) Transformed loop. (b) Its workload. 

Fig. 6. The impact of loop unrolling. 



cpuBusy 


0.67% 


memBusy 


65.60% 


bothBusy 


33.73% 



is to prefetch needed data to avoid CPU interlocks. Figure 7(a) shows a version 
of the transformed code. 

for (i=0; i<n; i++) { 
prefetch A[i] [0] 
for(j=0; j<n-16; j+=16) { 
prefetch A[i] [j+16] 
accu += A [i] [j] ; 



accu += A[i] [j+15] ; } } 

(a) Transformed loop. (b) Its workload. 



cpuBusy 


0.67% 


memBusy 


74.04% 


bothBusy 


25.29% 



Fig. 7 . The impact of software data prefetch. 



The explicitly data-prefetched program has similar workload distribution as 
the implicit version. Both allow the CPU speed to be reduced to 1/2 with only 
less than 1% of performance penalty. Figure 8 gives a summary of the relative 
performance, possible slow-down factors, and potential energy consumption for 
different versions of the optimized code. 

The results show that even at the highest optimization levels, dynamic volt- 
age and frequency scaling can achieve energy savings of 35% over the fastest, 
fully optimized version without any significant performance penalty (< 1%). 
For the unoptimized case, the 70% energy savings are obtained without any 
performance degradation. 

3.2 Relationship with Program Balance 

The concept of balance (/3) has been defined in a number of studies (e.g., [11, 
12,25,17,2]) as a ratio of the number of memory operations M to the number of 
floating-point operations U, i.e., (3 = M/F. When applied to a particular ma- 
chine, (3 can indicate either peak [11,12] or ’’average” [25] machine performance. 
Similarly, every program (or loop) has its own balance value (3p, which may take 
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<5 


T 


/ 


V 


E 


unoptimized 


1 


100.00% 


700 


1.65 


100.00% 




10 


100.00% 


70 


0.90 


29.75% 


inter- 

change 


1 


9.32% 


700 


1.65 


9.32% 


2 


11.84% 


350 


1.33 


7.69% 


unroll 


1 


6.37% 


700 


1.65 


6.37% 


2 


6.43 % 


350 


1.33 


4.18% 


prefetch 


1 


6.37% 


700 


1.65 


6.37% 


2 


6.44% 


350 


1.33 


4.18% 



Fig. 8. The impact of optimizations on performance and energy consumption: T is 
the relative execution time performance over the original, unoptimized code; J is the 
/ and V are the corresponding adjustment if running on an architecture with energy 
characteristics similar to the Crusoe TM5400 processor; E oc V^T is the relative energy 
consumption over the unoptimized code. The values for 5 selected by our model are 
shown in bold typeface. 



into account pipeline interlock [11]. Optimization techniques are proposed to 
restructure a program so that its balance be closer to the underlying machine 
balance. 

In terms of our model, program balance is a relationship between the work of 
the CPU and the memory, and their overlaps. When there is no overlap, slowing 
down CPU without significant performance penalties is not possible {cpuBusy 
is too large). This situation is already illustrated in Figure 5. On the other hand, 
when the work of CPU and memory overlaps almost perfectly, the program can 
be either cpu-bound or memory-bound. 

The loops in Figure 6 and 7 are considered memory-bound. Simulation re- 
sults show that 16 j-iterations take 32 cycles in total, assuming a perfect cache. 
Since the memory latency is 100 cycles, it cannot be be fully hidden in 16 j- 
iterations. On the other hand, CPU work is almost ’’embedded” in memory work. 
Memory-boundness may create opportunities for reducing CPU clock speed with 
negligible performance impact. 

Some workload reduction transformations change the program from memory- 
bound to cpu-bound. For instance, loop fusion combines the bodies of multiple 
loops into a single loop. It not only reduces loop overheads and memory accesses, 
but also increases CPU work relatively more than memory work per iteration. 
According to our model, such cpu-bound programs cannot be slowed down with- 
out significant performance penalty. On the other hand, memory accesses can 
be slowed down without affecting the total performance. 

4 Experiments 

All simulations are done through the SimpleScalar tool set [9], version 3.0a, 
with memory hierarchy extensions [10]. SimpleScalar provides a cycle-accurate 
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simulation environment for modern out-of-order superscalar processors with 5- 
stage pipelines and fairly accurate branch prediction mechanism. Speculative 
execution is also supported. The processor core contains a Register Update Unit 
(RUU) [30] which acts as a unified reorder buffer, issue window, and physical 
register file. Separate banks of 32 integer and floating point registers make up 
the architected register file and are only written on commit. 

The processor’s memory system employs a load/store queue (LSQ). It sup- 
ports multi-level cache hierarchies and non-blocking caches. The extensions add a 
one-level page table, finite miss status holding registers (MSHRs) [23] , and simu- 
lation of bus contention at all levels, but not simulation of page hits, precharging 
overhead, refresh cycles, or bank contention. 



4.1 Simulation Parameters 

The baseline processor core includes the following: a four-way issue superscalar 
processor with a 64-entry issue window for both integer and floating point oper- 
ations, a 32-entry load/store queue, commit bandwidth of four instructions per 
cycle, a 256-entry return address stack, and an extra branch misprediction of 3 
cycles. 

In the memory system, we use separate 32KB, direct-mapped level-one in- 
struction and data caches, with a 512KB, direct-mapped, unified level-two cache. 
The LI caches have 32-byte blocks, and the L2 cache has 64-byte blocks. The 
L1/L2 bus is 256 bits wide, requires one cycle for arbitration, and runs at the 
same speed as the processor core. Each cache contains eight MSHRs with four 
combining targets per MSHR. The L2/memory bus is 128 bits wide, requires 
one bus cycle for arbitration, and runs 1/4 of the processor core speed. Figure 9 
summarizes the simulation parameters used in the paper. 

4.2 Dynamic Voltage Scaling Capability 

The current implementation of the SimpleScalar tool set does not support dy- 
namic frequency scaling. Our simulation is done by multiplying the total number 
of CPU cycles and the slow-down factor. For example, if the clock speed of the 
baseline processor is reduced by half, the latencies of the memory and L2/bus are 
reduced by half. The total number of CPU cycles is then multiplied by two to get 
the absolute performance. We are in the process of extending the SimpleScalar 
instruction et to support dynamic speed setting. 

The energy estimation of the program is not yet incorporated into our version 
of the SimpleScalar simulator. In the future we will implement energy accounting 
as suggested by Wattch [5], which is based on SimpleScalar version 3.0 and 
publicly available. 

4.3 Experimental Results 

In addition to measurements for the accumulator loop shown in Figure 3(a) 
with results shown in Figure 8, we evaluated our model for the two BLASl 
kernels sdot and saxpy. Both codes were optimized by hand using advanced 
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Simulation 

parameters 


Value 


fetch width 


4 instructions/cycle 


decode width 


4 instructions/cycle 


issue width 


4 instructions/cycle, out-of-order 


commit width 


4 instructions/cycle 


RUU size 


64 instructions 


LSQ size 


32 instructions 


FUs 


4 intALUs, 1 intMULT, 4 fpALUs, 1 fpMULT, 2 memports 


branch predictor 


gshare, 17-bit wide history 


LI D-cache 


32KB, 1024-set, direct-mapped, 32-byte blocks, LRU, 
1-cycle hit, 8 MSHRs, 4 targets 


LI I-cache 


as above 


L1/L2 bus 


256-bit wide, 1-cycle access, 1-cycle arbitration 


L2 cache 


512KB, 8192-set, direct-mapped, 64-byte blocks, LRU, 
10-cycle hit, 8 MSHRs, 4 targets 


L2/mem bus 


128-bit wide, 4-cycle access, 1-cycle arbitration 


memory 


100-cycle hit, single bank, 64-byte/access 


TLBs 


128-entry, 4096-byte page 


compiler 


gcc 2.7.2.3 -03 



Fig. 9. System simulation parameters. 



transformations such as loop unrolling, loop splitting, software pipelining and 
software prefetching. The resulting code versions were compiled using gcc -03. 
The following table lists the measured values for the three model parameters 
cpuBusy, memBusy, and bothBusy, and the resulting slow-down factor S as 
computed by our model. 





sdot 


saxpy 


cpuBusy 


0.19% 


0.92% 


memBusy 


73.53% 


85.88% 


bothBusy 


26.27% 


13.20% 


6 


2 


2 



The performance of the optimized versions of the two kernels under different 
slow-down factors is reported in Figure 10. As in the case of the optimized accu- 
mulator kernel, the graphs show a performance behavior that is nearly constant 
for small values of S, until the performs starts to degrade close to linearly with 
the slow-down factor. Figure 11 shows the performance characteristics of the 
optimized codes and the computed CPU slow-down factor S. In both cases, the 
model determines 6 = 2. For sdot, the resulting energy saving is 33% with a 
3% performance penalty relative to the optimized version running at full CPU 
speed. For saxpy, these figures are 34% energy savings and 1.9% performance 
penalty. For the unoptimized code versions, the model was not able to determine 
slow-down factors greater than 1. 
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slow-down factor 




slow-down factor 



(a) Highly optimized sdot: 
loop splitting and unrolling, 
software pipelining & prefetching. 



(b) Highly optimized saxpy: 
loop nnrolling, 

software pipelining & prefetching. 



Fig. 10. Two simple BLASl kernels and their optimized performance under different 
CPU clock frequencies. The horizontal lines indicate the threshold for a 1% performance 
degradation. 
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unoptimized |1|100.00%|700|1.65|100.00% 


sdot 


optimized 


1 


75.18% 


700 


1.65 


75.18% 




2 


77.45% 


350 


1.33 


50.32% 


saxpy 


optimized 


1 


93.18% 


700 


1.65 


93.18% 




2 


94.92% 


350 


1.33 


61.67% 



Fig. 11. The impact of optimizations and computed CPU slow-down on performance 
and energy consumption for sdot and saxpy. The values for S selected by our model 
are shown in bold typeface. 



The experiments show that our model was not able to precisely predict the 
performance penalty as a result of the CPU slow-down. In our measured cases, 
this imprecision has no significant impact. However, we are currently investi- 
gating refinements to our model that will include the basic computation and 
memory access patterns of program regions, in particular loop nests. 



5 Related Work 

Extensive research on optimizing compilers has been carried out in the last few 
years [35,26], mostly execution time oriented and for high-performance proces- 
sors. Since battery-powered mobile computers are getting more popular, there 
is a growing interest in optimizing software for low power. 

In general, most performance-oriented transformations will also improve the 
overall energy consumption of an application [32]. However, recent results indi- 
cate that optimization techniques such as loop tiling and data transformations 
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may increase the energy usage of the datapath while reducing memory system 
energy, which leads to challenging trade-offs between prolonging battery life and 
limiting dissipated energy within a package [19,20]. In the context of loop tiling, 
it has been found that the best tile size for the least energy consumed is different 
from that for the best performance [19] . 

Other recent research has shown that power-aware register relabeling algo- 
rithm can reduce the overall power consumption of the register file by up to 12% 
without any performance impact [36]. For handheld battery-powered devices, 
compiler-directed remote task execution can be an effective technique to save 
energy on the mobile device [22] . 

5.1 Energy Models 

Just as performance-oriented compiler optimizations need performance models 
to evaluate various coding schemes, power-aware optimizations need power mod- 
els. 

Tiwari et al. [32] proposed to assign each instruction an energy cost and es- 
timate total energy consumption of a software based on instructions. Along the 
same way, [4] proposed a functional decomposition of the activities accomplished 
by a generic microprocessor and exhibited generalization capabilities. In [29], a 
function-level power estimation methodology is proposed. With this method, mi- 
croprocessor vendors can provide users the ’’power data bank” without releasing 
details of the core to help users get early power estimates and eventually guide 
power optimization. 

A lot of attention has been given to the memory subsystem for its energy con- 
sumption. For example, [31,15,18] all proposed analytical models for the memory 
subsystem with various precision. A more precise model may possibly take into 
account the run-time access statistics, which can be derived analytically (as 
many classical optimizations already do) or through simulation [1]. The analyt- 
ical energy models for buses are also proposed recently [37]. In addition, many 
simulators were built in the past few years to more precisely capture the energy 
consumption of the processor core. Wattch [5] and SimplePower [36] are two 
such examples. More details can be found in [3]. 

5.2 Dynamic Voltage Scaling 

Recently, methods have been developed to dynamically control supply voltage to 
adopt to the program’s execution behavior. For example, operating frequency can 
be set to the lowest possible for the program execution, and dynamically vary the 
voltage accordingly. This approach is used by Transmeta [33] and researchers at 
the University of California at Berkeley [8]. Another approach is to dynamically 
adjust a transistor’s threshold voltage. The chip can also be divided into blocks, 
with independent supply voltage control for each block. If a block is not in use, 
its supply can be cut to save energy. 

Dynamic voltage scaling does not come without overheads. For example, for 
a large voltage change, the transition can take as long as 520^s and consumes 
energy 130/xJ [7]. Such a long transition time suggests the coarse speed control 
and gradual speed settings. 
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6 Conclusion and Future Work 

Dynamic frequency and voltage scaling is an effective way to reduce power dis- 
sipation and energy consumption of memory bound loops. Choosing a maximal 
CPU slow-down factor is a difficult problem if deadlines have to be met. This 
paper discussed a simple performance model that allows the selection of efficient 
slow-down factors . Experiments based on three numerical kernels and a simu- 
lator for an advanced superscalar architecture indicate the effectiveness of the 
new model. Assuming the power characteristic of Transmeta’s Crusoe proces- 
sor, the resulting energy savings of our compilation strategy are in the range of 
33% to 70%. The results show that even for highly optimized code there is still 
significant room for additional energy savings by applying our power optimiza- 
tion strategy. More experiments will be needed to further validate our proposed 
compilation strategy. 

The implementation of the proposed models and compilation techniques 
are currently underway. In addition, we are extending our model to deal with 
computation-bound loops that allow energy savings by slowing down the mem- 
ory subsystem. Algorithms for partitioning the program into regions of fixed 
frequency and voltage assignments, with voltage and frequency transitions be- 
tween them need to be developed. In this paper, we have concentrated on single 
loop nests with single voltage and frequency assignments. Future work will ad- 
dress global, whole program solutions that will consider the execution time and 
energy overheads of voltage and frequency scaling. 

Acknowledgements. The authors wish to thank Professor Doug Burger from 
the University of Texas at Austin for providing the SimpleScalar tool set with 
his memory extensions. 
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Abstract. Reducing the supply voltage to reduce dynamic power 
consumption in CMOS devices, inadvertently will lead to an exponential 
increase in leakage power dissipation. In this work we explore an 
architectural idea to reduce leakage power in data caches. Previous work 
has shown that cache frames are “dead” for a significant fraction of time 
[14]. We are exploiting this observation to turn off cache lines that are not 
likely to he accessed any more. Our method is simple: if a cache-line is not 
accessed within a fixed interval (called decay interval) we turn off its 
supply voltage using a gated technique introduced previously [12]. We 
study the effect of cache-line decay on both power consumption and 
performance. We find that it is possible with cache-line decay to huild 
larger caches that dissipate less leakage power than smaller caches while 
yielding equal or better performance (fewer misses). In addition, because 
our method can dynamically trade performance for leakage power it can be 
adjusted according to the requirements of the application and/or the 
environment. 

1 Introduction 

Striving for low-power, high-performance CMOS devices drives supply voltage (V^^) 
to ever lower levels [8]. To maintain performance, a reduction in necessitates a re- 
duction in threshold voltage (V„,), which in turn increases leakage power dissipation 
exponentially [1,2,6]. Since chip transistor counts continue to increase, and every 
transistor that is powered on leaks irrespective of its switching activity, leakage 
power is expected to become a significant factor in the total power dissipation of a 
chip [2]. Given the current trends [1,13], the leakage power dissipated by a chip could 
equal its dynamic power within three processor generations. 

Although the leakage power of a SRAM transistor can be lower than the leakage 
power of high-speed logic transistors [5], on-chip caches can still contribute a 
significant percentage of a chip’s leakage power for two reasons. First, because a 
large fraction of a chip’s transistors are in the cache memory. Second, memory fabric 
cells are composed of low fan-in gates, namely cross-coupled inverters with only a 
single leaking transistor to a power rail. In contrast, significant parts of the logic 
circuits typically consist of higher fan-in gates with more transistors connected in 
series to a power rail {stacked transistors [6]). Given that the leakage power 
dissipation is becoming significant, circuit-level or micro-architectural solutions for 
on-chip caches are necessary to deal with the whole problem. 

One solution for reducing leakage power is to switch off power to unused devices. 
Powell, Yang, Falsafi, Roy, and Vijaykumar recently proposed a micro-architectural 
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technique called DRI cache and a circuit-level technique called gated to switch- 
off (or to large blocks of the instruction cache [12,15]. Motivated by their 
approach, we extend it by applying a similar idea to data caches but instead of large 
portions of the cache, we propose switching off individual cache lines. 

Our proposed scheme, called Cache Decay, consists of invalidating and turning off 
power to cache lines that have not been accessed for a certain interval, called the 
Decay Interval. When a powered down cache line is accessed, a cache miss is 
incurred while the line is switched back on and data are fetched from the next level of 
the memory hierarchy. Other cache-line aging techniques have been used in other 
contexts, for example for Dynamic Self Invalidation [9,10], and for managing group 
associative caches [11]. In contrast to previous work, we propose very simple, low- 
overhead implementations since our main goal is to reduce power consumption. 

We studied the cache access patterns for a set of SPEC95 benchmarks; they display a 
high degree of temporal locality (see Section 2), and indicate that turning off cache 
blocks that have not been accessed for an appropriate period of time will not 
significantly increase miss rates. We study the effect of varying the decay interval on 
a variety of benchmarks. For very small decay intervals (thousands of cycles), the 
application can suffer a larger number of cache misses; for very large intervals 
(hundreds of thousands of cycles), very few cache blocks may decay in time to be 
powered down. However, we find that for a wide range of decay intervals, the cache 
decay technique is successful in switching off large portions of the on-chip data cache 
without significantly affecting miss rates. 

Contributions of this paper are as follows: 

• We propose cache decay as a mechanism to turn off unused lines in the cache. 

• We describe in detail a digital implementation of the cache decay mechanism 
and discuss an analog implementation. 

• We study the effects cache decay on power and performance using execution- 
driven simulation and SPEC95 programs. Our results demonstrate the effective- 
ness of the cache decay scheme. In particular, we show that an El data cache 
without cache decay can be replaced by an El cache of twice the size with the 
same performance but up to 56% less leakage power. 

Organization of this Paper. In Section 2 we discuss cache decay and its im- 
plementations. Section 3 presents details of our experimental methodology and 
Section 4 results of our experiments. We conclude in Section 5. 

2 Cache Decay 

Recent work by Powell et al. [12] showed that powering down sections of the instruc- 
tion cache and resizing it significantly reduces leakage-power. Motivated by this ap- 
proach we examined switching-off parts of the data cache but at a much finer 
granularity (cache-line granularity) and without resizing. We rely on the fact that 
many cache frames are under-utilized and therefore can be turned off without impact 
on performance. Evidence of this comes from two papers: 

• Wood, Hill, and Kessler showed that the miss rate of unknown references (cold 
misses) in a trace-driven simulation with unknown initial conditions is much 
higher than the steady-state miss rate (e.g., 0.40 vs. 0.02) [14]. The high cold- 
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miss rate is simply the ratio of time a cache frame is dead (i.e., the time 
between last hit and replacement). 

• In a paper examining cache efficiency. Burger, Goodman, and Kagi showed 
that most of the data in a cache will not be used in the future (either will be 
overwritten or will not be accessed at all) [3]. 

Although discovering dead data in a cache is not a trivial matter, we hypothesized 
that a simple technique could actually capture some of the benefit. Our approach 
attempts to switch off least recently used cache lines assuming that these will be 
unlikely to be accessed in the future. To substantiate our hypothesis we profiled the 
execution of SPEC95 programs. Figure 1 shows the distributions of access 
intervals — intervals between consecutive accesses to the same cache line — for three 
programs.' The horizontal axis of the graphs is the access interval (in hundreds of 
cycles) and the vertical axis is the percentage of the accesses corresponding to an 
access interval (i.e., distance from the previous access). The last point in the 
horizontal axis represents the tail of the distribution which is quite small in gcc and 
vortex but sizable in compress. 

gcc compress vortex 






time(x100cycle$) tlme(xlOOcycles) time(xlOOcycles) 

Fig. 1. Access intervals for gcc, compress and vortex 



Since most consecutive accesses to the same cache-line are spaced closely in time — 
temporal locality — a cache line that has not been accessed for some time either will 
not be accessed again or it is one of the few cache lines that will be accessed very far 
into the future. Therefore, we propose to maintain power to cache lines as long as 
they are accessed within some predefined time interval {decay interval). We have 
identified digital and analog implementations to detect the passage of a decay interval 
from the last access to each cache line. We present these implementations in sections 
2.1 and 2.2. 

Regardless of the implementation, cache-line decay will increase the miss rate of the 
cache: a few lines will be powered-off before they are accessed. However, as we will 
show in Section 4 the miss rate of a decay cache is still less than a smaller cache 
whose size matches the average powered size of the decay cache. Another way to 
view the decay cache is from a leakage power efficiency perspective: the average 
powered size of a decay cache is smaller than a cache of equal miss rate. 



Other SPEC95 programs produce similai' results. 
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2.1 Digital Implementation 

One way to represent recency of a cache line’s access is via a digital counter. This 
counter is cleared on each access to the cache line and incremented periodically at 
fixed time intervals. Once the counter reaches its maximum count it saturates and 
switches off power (or ground) to the corresponding cache line. 

A casual interpretation of the graphs in Figure 1 suggests decay intervals of tens or 
hundreds of thousands of cycles. Because the number of cycles needed for a 
reasonable decay interval makes it impractical for the counters to count cycles (too 
many bits would be required) it is necessary for the counters to “tick” at a much 
coarser level, for example every few thousand cycles. A global cycle counter can be 
set up to provide the ticks for smaller cache-line counters (as shown in Figure 2). 
Simulations show that a two-bit counter per cache line provides sufficient resolution. 



1 1 1 1 N-bit Global Counter 


Tick Pulse (T) 


Valid bits Cache array 

_|_J [ij 1 Oarlne-linft Hata/TAG 1 


z-bit Decay 
Counters > ^ 


1 1 Lid 1 Cache-line dataTTAG 1 


1 1 Ivi 1 Cache-line data/TAG 1 


_| 1 |_yj 1 Cache-line data/TAG 



Fig. 2. High-level view of the digital implementation with 2-hit, Gray-code, saturating counters 

Global Counter. To save power, the global counter can be implemented as a binary 
ripple counter. An additional latch holds a maximum count value which is compared 
to the counter. When the counter reaches the maximum value, it is reset and a 1- 
clock-cycle T signal is generated. This circuit can be implemented with 40A H- 20 
transistors, where N is the number of bits required. The maximum-count latch is non- 
switching and doesn't contribute to dynamic power dissipation. On average, only two 
bits of the counter and comparator — less than 80 transistors — will change state and 
dissipate dynamic power each clock cycle. 

Cache-Line Counters. To minimize state transitions in these counters — and thus 
minimize dynamic power consumption — we use Gray coding so only one bit changes 
state at any time. Furthermore, to simplify the counters and minimize transistor count 
we chose to implement them asynchronously. Each cache line contains circuitry to 
implement the state machine depicted in Figure 3. 




Stale diagram for 2-bit Gray-code counter 
WRD signal (access) 




Si.So 



Next So = So • T ■ WRD -i- Si ■ T • WRD 
Next S, = T • WRD -t Si • Sq ■ W^ 



Fig. 3. Two-bit (SO, SI), saturating. Gray-code counter with two inputs (WRD and 7) 
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There are two inputs to the counter circuit: 

1 . A global time signal, T, is a periodic pulse to indicate the passage of time and it 
is supplied by the (synchronous) global cycle counter. T is a well behaved 
digital signal whose period may be adjusted externally to provide different 
decay intervals appropriate for different programs. 

2. The second state machine input is the cache line access signal, WRD, which is 
decoded from the address and is the same signal used to select a particular row 
within the cache memory (e.g., the WORD-LINE signal). 

State transitions occur asynchronously on changes of the two input signals, T and 
WRD. But since T and WRD are well behaved signals, there are no meta-stability 
problems. The only output is the cache-line switch state, POOFF (POwer OFF). 

Implementation Details. Switching-off power to a cache line has important 
implications for the rest of the cache circuitry. In particular, the first access to a pow- 
ered-off cache line should: 

1. result in a miss (since data and tag might be corrupted without power) 

2. reset the counter and restore power to the cache line (i.e., restart the decay 
mechanism as per Figure 3) 

3. be delayed an appropriate amount of time until the cache-line circuits stabilize 
after power is restored. 

To satisfy these requirements we use the Valid bit of the cache line as part of the 
decay mechanism (Figure 4). First, the valid bit is always powered. Second, we add a 
reset capability to the valid bit so it can be reset to 0 (invalid) by the decay 
mechanism. The POOFF signal clears the valid bit. Thus the first access to a power- 
off cache line always results in a miss regardless of the contents of the tag. Since 
satisfying this miss from the lower memory hierarchy is the only way to restore the 
valid bit, a newly powered cache line will have enough time to stabilize. In addition, 
no other access (to this cache line) can read the possibly corrupted data in the interim. 



ai.ORAI.COUfnTR VAUDBIT l.[Nr,nTT.S (DAT.A + TAH) 

V V 
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ALWAYS POWERED 






SWITCHED POWER 



Fig. 4. Cache-line power control 



2.2 Analog Implementation 

An alternative way to represent the recency of a cache line’s access is via charge 
stored on a capacitor (Figure 5). Each time the cache line is accessed, the capacitor is 
grounded. In the common case of a frequently accessed cache-line the capacitor will 
be discharged. Over time, the capacitor is charged through a resistor connected to 
Once the charge reaches a sufficiently high level, a voltage comparator detects it. 
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asserts the POOFF signal and switches off power (or ground) to the corresponding 
cache line (data bits and tag bits). 

Although the RC time constant cannot be changed (it is determined by the fabricated 
size of the capacitor and resistor) the bias of the voltage comparator can be adjusted 
to different program’s temporal access patterns. An analog implementation is 
inherently noise sensitive and can change state asynchronously with the remainder of 
the digital circuitry. Some method of synchronously sampling the voltage comparator 
must be employed to avoid meta-stability. Since an analog implementation can be 
fabricated to mimic the digital implementation, for the rest of this paper we focus on 
the latter. 



Vdd 




Fig. 5. Analog implementation. Switch-off cache line on capacitor charge. 



3 Methodology 

In this work we use execution-driven simulation to study the run-time behaviour of 
decay caches (e.g., miss rate and ratio of powered-off cache-lines). We then use the 
simulation results to model leakage power consumption. The results of the power 
models allow us to make comparisons among cache configurations with and without 
decay mechanisms. In Section 3.1 we describe in more detail our experimental setup 
and in Section 3.2 we discuss power consumption models. 

3.1 Experimental Setup 

To evaluate the effectiveness of cache-line decay we use seven SPEC95 benchmarks. 
We present detailed results for three programs exhibiting medium (gcc), high (com- 
press), and low (vortex) miss rates. We simulated the execution of these benchmarks 
for 500 million instructions using SPEC95 reference inputs on the SimpleScalar 
simulator using the SimpleScalar 2.0 tool set [4]. We use the detailed, out-of-order 
superscalar processor (with non-blocking caches) simulator to run the benchmarks 
since we must accurately account for time differences in cache accesses. Simulator 
parameters are shown in Table 1. In our studies we concentrate on LI caches. We 
chose to examine small cache sizes from SKbytes to 32Kbytes because SPEC95 
programs do not stress larger caches (we want to be conservative since our methods 
would work much better in larger, less utilized caches). In small caches virtually all 
cache lines were accessed during the execution of the programs. 

The simulator was modified to switch off cache lines (both tag and data) according to 
the cache decay schemes described in Section 2. Every access to the cache block 
restarts the decay mechanism for that line. We use the simulator to measure various 
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statistics such as the number of cache misses, the fraction of the cache that is powered 
up, the number of times the decay counters are incremented and change state, etc. 

Table 1. Simulator configuration parameters 



Parameter 


Value 


Physical Registers 


64-INT, 64-FP 


Fetch, Decode, Issue, Commit 


4 instmctions per cycle 


Functional Units 


4 IntALU, 1-IntMult/Div, 2 FP ALU, 1- 
FPMult/Div, 2 MemPorts 


Branch Predictor 


Combined, Bimodal 4K table, 2-Level IK table, 
lObit history, 4K chooser 


BTB 


1024-entry, 2-way 


Return Address Stack 


32-entry 


Mispredict penalty 


7 cycles 


Li Dcache Size 


8K, 16K and 32K, 2-way, 16B blocks 


Li Icache Size 


64K, 2-way, 32B blocks 


L2 (Omitted in some experi ments) 


Unified, 256K, 8-way LRU, 32B blocks, 12-cycle 
latency 


Memory 


100 cycles 


TLB Size 


128-entry, 30-cycle miss penalty 



3.2 Power Computation 

The additional dynamic power dissipated due to the decay circuitry is proportional to 
the product of its load capacitance and the switching activity of its transistors. For the 
implementation described in Section 2, less than 110 of its transistors switch on 
average every cycle. The entire decay circuitry involves a very small number of 
transistors: a few hundred for the global counter plus under 30 transistors per local 
cache line counter. All local counters change value with every T pulse. However, this 
happens at very coarse intervals (equal to the period of the global counter). Resetting 
a local counter with an access to a cache line is not a cause of concern either. If the 
cache line is heavily accessed the counter has no opportunity to change from its initial 
value so resetting it does not expend any dynamic power (none of the counter’s 
transistors switch). The cases where power is consumed are accesses to cache lines 
that have been idle for at least one period of the global counter. Our simulation results 
indicate that over all the 2-bit counters used in our scheme, there is less than one bit 
transition per cycle on average. Thus, the dynamic power dissipation of the decay 
circuitry is negligible compared to the dynamic power dissipated in the remainder of 
the chip, which presumably contains millions of transistors. We therefore compute in 
detail only the leakage power of the cache with and without decay. 

We assume a fixed threshold voltage in our experiments. The leakage power for the 
cache is therefore assumed to be proportional to the total number of cache lines that 
are powered-on in the cache. We compare the leakage power of the original cache 
with that of the additional decay circuitry by assuming it to be proportional to the 
total number of transistors in both those subblocks. We compute the total number of 
transistors in the original cache (we include both data bits and tag bits) and the 
additional decay circuitry from the transistor counts in Section 2. 
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4 Results 

We start our evaluation with the behaviour of the decay scheme using full counters 
per cache-line (Section 4.1). This is an idealized scheme — clearly impractical — but 
shows the behaviour of various cache configurations with precise control of the decay 
interval. Subsequently, Section 4.2 presents detailed results for a realistic digital 
implementation with a global counter and 2-bit counters per cache-line. In the same 
section we show power and performance benefits of the decay caches over standard 
caches half their size. Section 4.3 shows an alternative view of cache decay 
comparing equal size decay and standard caches for seven SPEC95 programs. 



4.1 Sensitivity to Decay Interval, Cache Size, and Block Size 

Results in this section were obtained with precise control over the decay interval: a 
cache line is switched-off when a specific number of cycles has passed since it was 
last accessed. We achieve this by simulating full counters (as many bits as needed) 
per cache line. Varying the decay interval, we measure miss rate and active ratio for a 
given cache configuration. We define active ratio as the average part of the cache that 
is switched on per cycle during the execution of a program. Figure 6A (left side 
graphs) shows the graphs for three programs using 16K, 16-byte-block caches. An 
infinite interval {“inf.” on the x axis) represents the standard cache where nothing is 
turned off. In this case we have the minimum miss rate and the maximum active ratio. 
For all three programs the active ratio is 100% meaning that the whole cache is 
accessed. Decreasing the decay interval to 512K and 64K cycles, increases miss rate 
slightly but decreases the active ratio significantly. However, further decreasing the 
decay interval to 8K and IK cycles starts to show dramatic increases in miss rate. 
This result also agrees intuitively with data presented in Section 2: as we begin to 
switch-off valuable cache lines (frequently and heavily accessed), miss rate becomes 
increasingly worse. 

Figure 6B (right side graphs) extends Figure 6A (left side graphs) by adding curves 
for smaller (SKBytes) and larger (32KBytes) caches. Similar miss rate and active 
ratio curves resulted for smaller and larger caches albeit shifted with respect to the 
axes. As expected, smaller caches have higher miss rates. However, the miss rate of 
all caches converges to the same value as we decrease the decay interval. Smaller 
caches also have higher active ratios for a given decay interval. Active ratios 
converge toward 100% as we increase the decay interval. Figure 6 provides two rules 
for decay intervals: (i) smaller caches need a smaller decay interval to achieve the 
same active ratio and (ii) smaller caches need a smaller decay interval to yield the 
same relative increase in miss rate. 

We also examined the relation of the decay interval to line size. Figure 7 shows how 
the active ratio and the miss rate change as a function of line size (16, 32 and 64 
Bytes). Whereas miss rate can either increase or decrease depending on the spatial 
characteristics of the application, active ratio increases with larger line size. For a 
given decay interval larger cache lines are less likely to be turned-off. 
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Fig. 6. Miss rate and Active ratio as a function of decay interval for GCC, Compress and 
Vortex. A: 16KB caches. B: Comparison of 32KB, 16KB, and 8KB caches (2-way, 16-byte 
blocks). 
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Fig. 7. Miss rate and active ratio as a function of block size for GCC and Vortex with 
16KBytes, 2-way caches. 



4.2 Digital Cache-Line Decay in Data Caches 

As we described in Section 2, a realistic digital implementation employs two-bit 
counters per cache line and a global counter to generate the tick (T) signal at regular 
intervals. The global counter’s period, can be set externally. The decay interval 
in this case is not exact but rather varies from two to three times with an average 
value of 2.5 Comparing results of the variable decay interval with those 

presented in the previous section (precise decay interval) we found very little 
difference. The resolution of the two-bit counter is enough to approximate a precise 
decay interval equal to 2.5 T .. 

Using two-bit counters we compare decay caches to standard caches by: first, keeping 
the miss rates equal, and second, maintaining the effective size of the decay cache 
equal to the size of the standard cache. 

4.2.1 Equal Miss Rate Comparisons 

Cache-line decay is a trade-off between dynamic and static power. By switching off 
cache lines we save leakage power but, on the other hand, we incur more misses 
which consume dynamic power. Dynamic power is also dissipated by the power- 
management circuits but this is negligible. Quantifying power consumption for a 
cache miss requires precise knowledge of implementation details: bus power 
consumption, timing, buffers, etc. We believe that results specific to an 
implementation cannot be generalized. 

We remove power consumption due to cache misses by comparing standard caches to 
decay caches of double size but of equal miss rate. We control miss rate in the decay 
caches by choosing an appropriate decay interval per application. 

We use curve fitting on the data presented in Figure 6 to approximate miss-rate 
curves. In this way we estimate a decay interval that will give us approximately a 
desired miss-rate. Figure 8 shows the results of this approach. The first graph 
compares a 32KByte decay cache to a 16KByte standard cache and the second graph 
compares a 16KByte decay cache to a 8KByte standard cache. We list the estimated 
decay intervals for every case in Table 2. The bar pairs show miss rates for the decay 
and standard caches, while the solid line shows the effective size of the decay cache 
(active ratio multiplied by actual size). The effective size of the 32KByte decay cache 
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is 27%, 50%, and 10% smaller than the 16KByte standard cache for gcc, vortex, and 
compress respectively (for the 16KByte decay cache, 26%, 59%, and 5% smaller 
than the 8KByte standard cache). 

Table 2. Decay intervals for equal-miss-rate comparisons 
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16KByte Decay cache 
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32Kcycles, Tperiod= 12.8K 
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Fig. 8. Equal miss-rate comparisons: a 32KByte (16KByte) decay cache has smaller ejfective 
size than a 16KByte (8KByte) standard cache. 



By keeping the miss rates constant we reduce leakage power hut on the other hand we 
have added additional circuitry that dissipates both dynamic and leakage power. The 
increase in dynamic power is negligible compared to the dynamic power dissipation 
of the entire chip; in particular, on average less than 1 10 transistors in the additional 
decay circuitry switch every clock cycle (compared to the millions switching in the 
processor core). We therefore focus on computing leakage power. Figure 9 shows the 
relative change in the leakage power of the cache itself when the cache decay 
mechanism is used. Although the additional circuitry increases leakage to a small 
extent, the total leakage power of the cache is reduced significantly because large 
portions of the data cache get turned off. Since the cache is one of the main 
contributors to the total leakage power of the chip, cache decay results in large 
savings when leakage power becomes significant. 

4.2.2 Equal Size Comparisons 

An alternative way to examine decay caches is to keep their effective size the same as 
a standard cache half the size, i.e., keep the active ratio less than or equal to 50%. 
Again we estimate the decay interval that will give us such an active ratio from the 
graphs in Figure 6. Figure 10 shows that a decay cache of equal effective size as a 
standard cache results in lower miss rate (significant reduction for compress). This in 
turn translates into increased performance and decreased dynamic power attributable 
to misses. 
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Fig. 9. Static power dissipation of decay caches (32KBytes and 16KBytes) normalized to 
standard caches half their size (16KBytes and SKBytes) for gcc, compress and vortex. Decay 
does not increase miss rate in this case. 
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Fig. 10. Equal area comparisons: a 32KByte (16KByte) decay cache of approximately 
16KByte (8KByte) ejfective size has a smaller miss rate than a 16KByte (8KByte) standard 
cache. 



4.3 Effects on Performance 

Besides the above relative comparisons, we show the effects of cache-line decay on 
miss rate, IPC, and active size by comparing decay caches to standard caches of equal 
size. Here, we do not use L2 caches — LI misses are serviced directly from 
memory — for two reasons: i) we want to make evident the IPC impact of the decay 
mechanisms (L2 caches tend to reduce it to insignificant levels), and ii) we 
concentrate on small LI caches to reflect the size of the benchmarks — cache-line 
decay would work well in relatively large cache hierarchies. 

Figure 11 shows the miss rate, IPC, and effective size (percent of actual size) for 
32KByte decay and standard caches. We chose a single decay interval of 128K cycles 
for all benchmarks (which may not be optimum for all benchmarks). This decay 
interval is larger than what Figure 6 would suggest for the miss rates and active ratios 
of Figure 11. This is because the lack of L2 expands the time between cache accesses. 
The increased average memory latency necessitates an increase in the decay interval. 
Increases in miss rate range from 8% (m88Ksim) to 293% (go) with an average of 
88%. However, the decrease in IPC is moderate and ranges from to 6% (m88Ksim 
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and li) to 18% (go) with an average of 14%. The superscalar out-of-order core largely 
hides the increase in average access latency. Decrease in leakage power for the cache 
memory array ranges from 57% (gcc) to 75% (go) with an average of 67%. 
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Fig. 11. Miss rate, IPC and Active ratio for 32K standard and 32K decay cache. There is no L2 
cache. 128K cycles decay interval for all programs. 



5 Conclusions 

In this paper we propose cache decay, a mechanism to reduce leakage power 
dissipation in caches. We turn off power to cache lines that have not been accessed 
within a decay interval. By controlling power at a cache-line granularity we can 
achieve a significant reduction in leakage power while at the same time preserve 
much of the performance of the cache. A decay cache can have an ejfective powered 
size much smaller than a cache of equal miss-rate. Alternatively, a decay cache with 
the effective powered size of a small cache performs better. In addition, the full 
performance of the decay cache is available to demanding applications when power 
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consumption in not an issue. This flexibility of the decay cache is particularly useful 
in battery-powered computers. Initial results also show that cache decay can be 
successfully applied to instruction caches. 

Our results show that different applications have different optimal decay intervals for 
a given miss-rate/active ratio target. Our proposed digital implementation can be 
controlled at run-time by the operating system via the global cycle counter. The OS 
can set the period of the global counter to produce the desired decay interval 

according to the demands of the executing application and the power-consumption 
requirements of the system. Profiling and/or run-time monitoring can be used to 
adjust decay intervals. We are also examining adaptive approaches, where the decay 
interval is adjusted individually for each cache line. By monitoring each cache line’s 
extraneous decay misses its decay interval can be adjusted to avoid repeating 
mistakes. Such methods show promise for good performance without the need to set a 
decay interval on a per-application basis. With the increasing importance of leakage 
power in upcoming generations of CPUs, and the increasing size of on-chip memory, 
cache decay can be a useful architectural tool to reduce leakage power consumption. 
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Abstract. Power dissipation is a major concern not only for portable 
systems, but also for high-performance systems. In the past, energy con- 
sumption and processor heating was reduced mainly by focusing efforts 
on mechanical or circuit design techniques. Now that we are reaching 
the limits of some of these past techniques, an architectural approach 
is fundamental to solving power related problems. In this work, we use 
a model of the Alpha 21264 to simulate a high-performance, multi- 
pipelined processor with two integer pipeline clusters and one floating 
point pipeline. We propose a hardware mechanism to dynamically 
monitor processor performance and reconfigure the machine on-the-fly 
such that available resources are more closely matched to the program’s 
requirements. Namely, we propose to save energy in the processor by 
disabling one of the two integer pipelines and/or the floating point 
pipe at runtime for selective periods of time during the execution of a 
program. If these time periods are carefully selected, energy may be 
saved without negatively impacting overall processor performance. Our 
initial experiments shows on average total chip energy savings of 12% 
and as high as 32% for some benchmarks while performance degrades 
by an average of only 2.5% and at most 4.5%. 

Keywords: architecture-level, high-performance, low-power 



1 Introduction 

State-of-the-art general purpose processors are designed with performance as the 
primary goal. This means that designers must tune their processors to achieve 
high performance on the greatest number of applications. Since applications may 
vary widely in their resource requirements, designers may choose to include cer- 
tain hardware structures on chip knowing that they will be useful to only a 
subset of the applications. For instance, the Alpha 21264 processor can execute 
out-of-order up to 6 instructions per cycle. However, this complex hardware fea- 
ture offers limited benefits over a simple, in-order structure, if the applications 
the processor is running contain limited instruction- level parallelism (ILP). Like- 
wise, a sophisticated 2-level branch predictor such as the one described in [9] may 
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greatly reduce branch misprediction rates for integer programs but may be un- 
necessary for floating point applications that tend to contain highly predictable 
branch behavior. While these architectural enhancements are essential in order 
to attain high performance across a range of applications, they are expensive in 
terms of energy consumption. 

Performance (as measured in terms of resource utilization) may vary not only 
for different applications but even during the execution of a single application. 
For instance, Wall showed that the amount of ILP within a single application 
varies by up to a factor of three [11]. Similar experiments were conducted by 
Sherwood and Calder [10] where variations in average instructions committed 
per cycle was correlated with architectural features such as as branch prediction, 
value prediction, cache performance and reorder buffer occupancy. Figure 1 il- 
lustrates in more detail this phenomenon of varying performance by displaying 
the average number of instructions issued per cycle over time. Each data point 
represents the issue rate for a window of 1000 cycles. From a performance point 
of view, some of the under utilized portions of the processor can be completely 
disabled during the “low-issue” windows without hampering performance. 




Fig. 1. Variation in IPC over time for the hydro2d benchmark. 



Previous work has tried to capitalize on this phenomenon of underutiliza- 
tion of resources, both within and across different applications. One study has 
proposed the use of Complexity- Adaptive Processors to reconfigure hardware to 
match the diverse needs of a particular application [1]. A dynamic clock would 
allow each configuration to operate at its full potential. Alternatively, the work 
of [6] proposes that the software select a desired performance for an application 
based on workload traces. The issue mode of the processor (e.g. in-order or out- 
of-order) would then be adjusted in hardware to meet the targeted performance. 

In this paper, we propose a hardware mechanism to dynamically monitor pro- 
cessor performance and enable or disable selective parts of the CPU subsystem 
on-the-fly. This mechanism allows the processor’s available resources to more 
closely match the needs of the program (i.e., as determined by the program’s 
ILP). The processor then saves energy by not consuming power in hardware 
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structures that are disabled. We use a model of the Alpha 21264 to simulate 
a high-performance, multi-pipelined processor with two integer pipeline clusters 
and one floating point pipeline. Within the context of our processor model, this 
translates into selectively disabling one of the two integer clusters and/or part 
or all of the floating point pipe cluster whenever the program cannot make good 
use of these extra resources. Likewise, these clusters may also be re-enabled when 
ILP increases to the point where the extra resources may be useful to improve 
performance. By properly monitoring when to enable/disable part of the CPU 
subsystem we save energy without impacting performance. 

The paper makes the following contributions: 

— We show that hardware monitoring of performance can be used to predict 
future short-term performance within an application. 

— We propose several techniques to estimate processor performance in order 
to determine when the processor should enter low-power mode. 

— We show that processor resources can be significantly reduced for selective 
time periods without hampering performance. 

— We show that on average 12% of the total energy of the processor may be 
reduced when pipelines are selectively disabled. 



2 Experimental Methodology 

The simulator used in this study is derived from the SimpleScalar [3] tool suite. 
SimpleScalar is an execution-driven simulator that uses binaries compiled to 
a MIPS-like target. SimpleScalar can accurately model a high-performance, 
dynamically-scheduled, multi-issue processor. We added modifications to Sim- 
pleScalar to incorporate the following enhancements: 

— Multi-pipelined issue and execution clusters (2 integer, 1 floating point). 

— More accurate fetch unit including a collapsing buffer. 

— Performance monitoring hardware to model our power management tech- 
niques. 

In addition, we used a modified version of the Wattch framework [2], interfaced 
with SimpleScalar, to estimate energy consumption at the architectural level. 
Details of these modifications follow. 

Our baseline simulation configuration is derived from the basic Alpha 21264 
pipeline and extended to model a future generation microarchitecture based on 
this same multi-pipelined configuration. Specifically, we retain from the Alpha 
21264 processor the concept of partitioning the issue and execution phases into 
three pipelines (2 for integer instructions, and 1 for floating point). Within the 
SimpleScalar model, this translates into partitioning the register update unit 
(RUU) into three distinct regions. The RUU is a combined instruction window, 
array of reservation stations, and reorder buffer. The processor can issue up to 
8 instructions per cycle. 

The simulator also includes a more aggressive fetch stage that aligns I-cache 
accesses to block boundaries and implements a variant of the collapsing buffer [4] . 
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Fig. 2. Basic multi-pipelined processor. 



With the collapsing buffer, the fetch unit can deliver up to two (not necessarily 
contiguous) basic blocks from the I-cache per fetch cycle for a maximum of 8 
instructions total. 

Table 1 shows the complete configuration of the processor model. Note that 
the RUU and LSQ are divided into three parts corresponding to the two integer 
cluster pipelines (ICi and IC 2 ) and the floating point pipeline (FP). Also note 
that the ALU resources listed in the table may incur different latency and oc- 
cupancy values depending on the type of operation that is being performed by 
the unit. 



Table 1. Machine configuration and processor resources. 



Par am. 


Units 


Param. Configuration 


Fetch/Issue Width 


8/12 


ILl 


64KB 4-way; 32B line; 1 cycle 


Integer ALU 


8 


DLl 


64KB 4-way; 32B line; 1 cycle 


Integer Mult/Div 


2 


L2 


256KB 4-way; 64B line; 6 cycle 


FP ALU 


4 


Mem. 


128 bit-wide; 20 cycles on hit, 


FP Mult/Div/Sqrt 


2 




50 cycles on page miss 


Memory Ports 


2 


B_Pred 


4k + 4k + 4k 


RUU IC 1 /IC 2 /FP 


64/64/16 


BTB 


IK entry 4-way set assoc. 


LSQ IC 1 /IC 2 /FP 


32/32/8 


RAS 


32 entry queue 


Fetch Queue 


8 


ITLB 


64 entry fully assoc. 


Min. Mispred. Penalty 


6 


DTLB 


128 entry fully assoc. 
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A high-level view of the basic pipeline in shown in Figure 2. Notice that 
after the decode and register renaming stages, instructions are sent either to 
the integer issue queue or to the floating point queue. Before they enter the 
queue, integer instructions are statically assigned to either the upper or lower 
functional unit clusters. Arbitration among the instructions ready for execution 
occurs across all three clusters. The integer issue queue has two separate arbiters 
that dynamically issue the oldest queued instructions each cycle within the upper 
and lower clusters respectively (i.e. each cluster may issue up to 4 instructions 
per cycle). In addition, the floating point issue queue can issue a maximum of 
four instructions. 

Our simulations are executed on a subset of SPECint95 and SPECfp95 bench- 
marks [5] . They were compiled using a re-targeted version of the GNU gcc com- 
piler with full optimization. This compiler generates SimpleScalar machine 
instructions. Since we are executing a full model on a very detailed simulator, 
the benchmarks take several hours to complete; due to time constraints we ap- 
ply the detailed simulator model only to a sampling of the program execution. 
All benchmarks are fast-forwarded for 50 million instructions to avoid startup 
effects. The benchmarks are then are executed for 100 million committed in- 
structions, or until they complete, whichever comes first. All inputs come from 
the reference set and are shown in Table 2. 



Table 2. The benchmarks used in this study. 



Benchmarks 


Input 


apsi 


ref 


compress 


ref 


go 


2stone9 


gcc 


varasm.i 


li 


ref 


hydro2d 


ref 


perl 


primes 


vortex 


persons. Ik 



3 Energy Considerations 

Figure 3 compares the energy consumption breakdown obtained using Wattch 
for a subset of SPEC95 benchmarks, assuming all pipelines are always enabled. 
We assumed linear clock gating for multi-ported hardware. In addition we as- 
sume a non-zero “turnoff factor” such that any enabled unit will still dissipate 
10% of its maximum power when it is not used during a particular cycle. No- 
tice that over 30% of the energy consumption is due to the clocks whereas the 
issue queue (window in the figure) and functional units (ALU) together make 



102 



R. Maro, Y. Bai, and R.I. Bahar 



Enorgy Distribution 



1 00*5^ 



« 0 *V 4 . 












0«M. 




C 3 GiocsK 
H rOHUltbUH 
B slu 

a dcschie2 
Si dcactte 
B icschts 
B rogtilo 
a Imc) 
a window 
B lr>procl 
O rsoams 





•s 

m 



Fig. 3. Distribution of energy for different benchmarks. 



up 45% of the total energy consumption on the chip. Therefore, by employing 
our processor reconfiguration technique to selectively disable clusters of issue 
and execution hardware, we are attempting to reduce energy consumption in 
sections of hardware that comprise a significant portion of the overall energy 
consumption. 



3.1 Reconfiguring the Processor for Energy Savings 

To save eirergy in the processor, we propose several low-power modes that the 
processor cair enable or disable dynamically during program executioir. Following 
is a descriptioir of the different modes we considered: 

low-power mode LF: One integer cluster is disabled and the remaining re- 
sources are fully operationally. 

low-power mode LH: Oire iirteger cluster is disabled and the floating poiirt 
pipe has half its functional units disabled, 
low-power mode LL: One iirteger cluster and the entire floating point cluster 
are powered off. 

low-power mode FH: The integer clusters are fully operational but the float- 
ing point pipe has been partially disabled. 

Henceforth we will refer to a generic low-power mode when one of these states is 
enabled. 
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While in low-power mode, one or more of the pipeline clusters is partially or 
fully disabled and the clocks going to these clusters are also completely disabled. 
This mode reduces the number of issue queue entries and functional units effec- 
tively reducing the issue width as well. In order to retain coherency, writes (but 
not reads) continue to be made to both copies of the integer register file. This 
low-power configuration allows power to be saved by the functional units, the 
register file, and the instruction window’s issue queues and selection logic. More 
details on the amount of savings will be presented in Section 5. Once low-power 
mode is enabled, we must drain the disabled pipeline(s) before their associated 
issue queue entries, functional units, and clocks can be disabled. Thus, there 
is some cost involved in switching the machine to this mode. To minimize this 
cost, we require that the processor remain in any one mode for a pre-defined 
minimum number of cycles. 

3.2 Estimating Power 

As stated before, we use the Wattch framework within SimpleScalar to esti- 
mate power [2]. Several modifications were made to both SimpleScalar and 
Wattch in order to accommodate a multi-pipelined processor that can reconfigure 
its resources dynamically. With the new modifications, the power contribution 
of each hardware structure varied according to 

1. the effective issue width of the processor, 

2. the total number of times the structure was accessed, 

3. the number of ports accessed in a particular cycle, and 

4. the particular low-power mode enabled. 

In particular, power dissipation in the issue selection logic, instruction window, 
load/store queue, integer functional units and global clocks is reduced when one 
of the clusters is disabled. We created a separate power model for each pipeline 
and one for the remaining circuitry. The power model for each pipeline included 
power estimates for the register file, the window, the LSQ and the result-bus 
structures, using access data from the SimpleScalar simulator to obtain these 
estimates. If one of the integer pipelines was disabled during low-power mode, 
we don’t update the window, the LSQ and result-bus power components but we 
still consider the power used by the integer register file unit in order to maintain 
coherency with the other integer register file. When the floating point pipeline 
is disabled all components, including the register file, are disabled since there is 
only one floating point register file. 

4 Performance Monitoring 

As mentioned in the introduction, we want to monitor via hardware the perfor- 
mance of a particular benchmark actively running on the processor. We can then 
use this information to dynamically determine when to reconfigure the processor 
into low-power mode. In this mode, selective parts of the processor are disabled 
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such that its active resources more closely match the program’s available ILP. In 
the following section we describe in more detail the various performance moni- 
toring techniques we implemented. 

4.1 Determining When to Enter Low-Power Mode 

Since the processor may perform differently depending on whether all its re- 
sources are enabled or not, we use different monitoring techniques to determine 
when to enable or disable low-power mode. Following are the hardware monitor- 
ing mechanisms we implemented to disable the integer cluster. 

Functional Unit Usage. A relatively inexpensive way to determine when the 
program is executing a critical section of code (i.e., where resources are being 
underutilized) is to monitor the effective integer functional unit usage over time 
by means of a simple shift register. When the percentage of busy functional units 
is under a certain threshold (e.g., less than 50% of all available functional units 
are utilized) a ’1’ is shifted into the register. At any given cycle, if the number of 
I’s present in this register is greater than some user-defined threshold, a critical 
section is detected. We make the assumption that if recent history indicates an 
underutilization of functional units then we can presume (at least for the near 
future) that in the following cycles few resource will be needed as well. 

Figure 4 displays the effective usage of the CPU functional units during the 
execution of a single program. The black line tracks the number of integer ALU 
functional units (1-8) used each cycle while the grey line tracks the number 
of memory port functional units used (1-2). The vertical lines show when the 
commands are given to enable/disable low-power mode. For example, the third 
and forth vertical lines at cycle 230 and 360 represent the time when commands 
are given to disable and then enable low power mode respectively. The bottom 
grey horizontal line represents the effective low power time period. There is a 
short delay between the time the power-off command is given and when the 
low-power mode is enabled in order to account for draining one of the effected 
cluster. Notice that once a critical section is detected, several cycles pass before 
multiple resources are needed again. 

Monitoring IPC. A second way to estimate critical sections is to compute the 
number of committed or issued instructions during a defined number of cycles 
(i.e., windows). If the measured IPC within this time window is below a certain 
threshold we put the machine in low-power mode. This scheme requires a re- 
setable counter that is cleared at the beginning of each window and incremented 
appropriately each cycle. The counter value is then compared with a threshold 
value at the end of the window. We must choose a time window that is not too 
small such that very short bursts of high/low activity are not interpreted by the 
monitoring hardware as a sign that the program’s general resource requirements 
are changing. On the other hand, if the window is made too large, we may not 
be able to react quickly enough to general changes in resource utilization which 
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Fig. 4. Functional unit usage: vertical axis shows the number of functional units busy; 
horizontal axis is the cycle offset 



may result in performance degradation or missed opportunities in energy sav- 
ings. Given these constraints, from our initial experiments, we found that 512 
cycles was a reasonable window size. This mechanism also can be combined with 
the shift register described above such that a low IPC must be present for a cer- 
tain number of windows before a critical section is detected, however, in general, 
the window size should be made smaller in this case. 



Detecting Variations in IPC. A third way is to compute the difference be- 
tween the committed and the issued instructions. If, on average, there are many 
more instructions being issued than committed, this may indicate that a large 
amount of mispredicted instructions are being issued. By entering low-power 
mode when such a situation is detected, we restrict the issue rate and indirectly 
the number of mispredicted instructions allowed to issue in the processor. In this 
way, we effectively employ a type of piping gating similar to what was done in [8] 
as a means of saving processor energy. 



Dependency Counting. A major limitation of increasing ILP is the presence 
of true data dependencies. Therefore, if we have many dependencies among the 
issued instructions, ILP is limited and there is a greater chance that proces- 
sor resources are underutilized. We propose two ways to estimate program ILP 
through dependency counting. The first one requires one counter to sum the to- 
tal number of input dependencies associated with each entry in the instruction 
window (i.e., RUU). Alternatively, we can include a counter for each RUU entry 
and increment the counter based on the number of input dependencies (directly 
and indirectly). If the total dependency count is higher than a given threshold, 
we assume limited ILP and allow the processor to enable low-power mode. 



Floating Point Utilization. As mentioned in Section 3, we proposed different 
low-power modes that allow us the flexibility to completely or partially disable 
the floating point pipeline cluster. We determine when to enable each of these 
modes as follows: 
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Half-power mode: When one of the integer clusters is disabled using one of 
the techniques mentioned above, we assume that in general ILP is limited 
and it is probable that not only integer resources are underutilized but also 
floating point resources. Therefore, when we signal to the processor to disable 
one of the integer clusters, we also signal to the floating point pipeline to 
disable half its resources and enter half-power mode. If the floating point 
cluster currently has valid instructions in its pipeline, these instructions are 
first allowed to complete before enabling this mode. 

Disable mode: We monitor the frequency of floating point instructions. If af- 
ter a certain number of cycles, no floating point instructions are fetched, we 
completely disable the floating point cluster. This decision is made indepen- 
dent of the current state of the integer clusters thereby allowing the floating 
point pipeline to be disable for very long periods of time, particularly for 
integer-intensive programs. 



4.2 Powering on Techniques 

We can use most of the same ideas described in the previous section to disable 
as well as enable low-power mode; however, threshold values may need to be 
adjusted differently. In addition, we also implemented the following mechanisms 
for determining when to return to full-power mode. 



Issue Attempts. Once we are in low-power mode, we need to react to local 
changes in performance that indicate our overall performance may suffer if we do 
not return to full-power mode. One way we determine this is to count the total 
number of issue attempts for each ready instruction before a functional unit is 
made available for its execution. To implement this scheme, a counter is added 
to each RUU entry; whenever a ready instruction is prevented from issuing due 
to lack of resources, its counter is incremented. If the total count for all the valid 
RUU entries reaches a certain threshold, this indicates that the processor should 
return to full-power mode. 



Variations in IPC. If the processor is committing more instructions than what 
it is able to issue, it can be advantageous to increase the issue width to sustain 
the commit rate. An alternative way to monitor when to return to full-power 
mode is to compute the delta between the issued and committed instructions 
during a certain time window. As with the powering off techniques described in 
Section 4.1, we can combine this technique with a shift register to record recent 
history such that the processor will not switch modes unless a large delta is 
detected over several windows. 



Trigger Events. In addition to the previous techniques we can decide to enable 
or disable one of the integer clusters according to the occurrence of defined events 
(e.g., data/instruction cache misses, ITLB cache misses, miss-predictions). For 
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example, the occurrence of an ITLB cache miss usually causes a long period 
of low activity followed by high functional unit usage for several cycles. It is 
important to note, however, that the latency caused by such events, may easily 
be hidden due to high parallelism in the code that may enable a high rate of 
execution while these events are being serviced. 



Floating Point Activity. Once a floating point instruction is fetched and 
decoded, a signal to enable the full floating point pipeline is sent so that the 
resources will be available when the instruction is ready to issue. The processor 
may also fully enable the floating point pipeline if it is currently in low-power 
mode LH (1 integer cluster is disabled and the FP is partially disabled) and 
one of the powering on techniques mentioned above triggers the disabled integer 
cluster to power back on. 

5 Experimental Results 

We experimented with a combination of policies for enabling and disabling low- 
power mode in the processor. In this paper we report results using three different 
low-power configurations; these configurations represent typical results. As a ref- 
erence, we also included results obtained when one integer cluster was always 
disabled (i.e. the processor operated at half the issue width with half the re- 
sources at all times) . The three low-power techniques were configured as follows: 

LPl: Monitor functional unit usage. Disable one of the integer clusters and half 
the floating point pipeline if fewer than half the functional units were used at 
least 10 out of 16 times (i.e. using a 16 bit history shift register). Reactivate 
the pipelines when the remaining functional units are used at 86% capacity 
for 3 out of the past 5 cycles. In addition, if no floating point instructions 
were detected for 3 cycles in a row, the floating point pipeline is completely 
disabled. Enable the full floating point pipeline (from either half or fully 
disabled mode) if any new floating point instructions are fetched. 

LP2: Disable the integer cluster and half the floating point cluster when the 
commit IPC < 2.0. Reactivate when commit IPC > 1.0. The floating point 
pipeline is completely disabled/enabled using the same method as in the LPl 
scheme. 

LP3: Disable the integer cluster and half the integer cluster when at least 45% of 
the entries have input dependencies with other values in the RUU. Reactivate 
when the functional units are used at 85% capacity for 3 out of the past 5 
cycles. The floating point pipeline is completely disabled/enabled using the 
same method as in the LPl scheme. 

Performance is reported in terms of IPC normalized to the base case (i.e., 
where full-power mode is always enabled). Power values are obtained using our 
modified version of Wattch and are reported in terms of total power normalized 
to the base case. When the performance monitors determine that one of the 



108 



R. Maro, Y. Bai, and R.I. Bahar 



Change in Performance 




Fig. 5. Variation in performance for different benchmarks 



Change in Energy Consumption 




Fig. 6. Distribution of power 



for different benchmarks, 
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integer clusters may be disabled, the integer pipelines are assumed to operate 
at full-power mode until the issue queue with the fewest entries has been fully 
drained at which point it is safe to completely disable that cluster’s issue queue 
and functional units. When the floating point pipeline is to be disabled, we first 
wait until the floating point issue queue is fully drained, at which point the entire 
floating point pipeline — including its issue queue, execution units, and register 
file — are disabled. 



F^esource Savings 




Fig. 7. Percentage of time each pipeline cluster was disabled using the LPl scheme 



In Figures 5 and 6 we show the relative performance drop and energy savings, 
respectively, using the three different low-power configurations LPl— LP3 as well 
as the “half-mode” configuration where one of the integers is always disabled. 
There are a number of findings we see from these results. First, if one of the 
integer clusters is always disabled, performance drops significantly (e.g. apsi, 
hydro2d, li, perl and su2cor all lose around 20% or more in performance). These 
results justify the need for including the extra resources in the processor. In 
contrast, by selectively disabling the integer and floating point clusters using 
any of our low-power techniques, performance is retained to within 4.5% of 
the base case and on average performance loss is less than 2.5%. Although it 
may dissipate less power on a given cycle, the half-mode configuration requires 
more cycles to complete execution, so overall energy consumption is often not 
significantly reduced and for benchmarks apsi, hydro2d, perl and su2cor energy 
consumption actually increases. In comparison, we can save up to 32% in energy 
consumption and on average 12% using our low-power techniques. 

To understand better where the energy is being saved, we looked more closely 
at the processor behavior using the LPl scheme. In Figure 7 we graphed the per- 
centage of time one of the integer pipelines was disabled as well as the percentage 
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of time the floating point pipeline was either partially or completely disabled. 
From the graph, we see that one of the integer clusters was disabled on aver- 
age 22% of the time and at most 54% for compress. The floating point cluster 
can operate with only half its resources usually less than 10% of the time, but 
the real savings comes from being able to completely disable the floating point 
pipeline for four of the integer benchmarks {go, li, mSSksim, perl). These four 
benchmarks are the same ones that show the greatest energy savings. 

While it is encouraging to see such high energy savings for some benchmarks, 
it is disappointing that other benchmarks, particularly compress and waved, 
did not show larger savings, especially considering how often the processor was 
able to disable one of the integer clusters (54% and 33% respectively). On the 
other hand, these energy results depend greatly on assumptions made by the 
power estimation tool. For instance, in Wattch the power dissipation due to 
reading and writing the instruction window was much larger than the power 
dissipation of the instruction selection logic. Using our low-power techniques, we 
are not reducing the overall number of instructions issued in the processor so in 
general the power dissipation due to reading and writing these instructions in 
the window should be the same. On the other hand, since we are only selecting 
at most 4 instead of 8 integer instructions to execute when one of the integer 
clusters is disabled, the instruction selection power should be greatly reduced. 
However, if the selection logic is a small contributor to the total power, it will 
not make a significant impact on the overall energy consumption. Under different 
assumptions the impact of the selection logic may be much greater. Similarly, 
in Wattch clock power is largely dependent on the effective issue width of the 
processor. If the floating point pipeline is always disabled, this translates to 
significant energy savings as was seen for some of the benchmarks. 

6 Conclusion 

In this paper we showed that programs generally underutilize the processor ex- 
ecution resources available to them. Simply eliminating these resources from 
the chip severely hampers performance; however, we showed that selectively 
disabling some resources during low activity periods saves energy without ham- 
pering performance. 

We have tested out several different techniques for enabling low-power mode. 
We would like to analyze these techniques in more detail to better understand 
the impact of such variables as window size, minimum time spent in each mode, 
and history length on performance and energy savings. 

Several modifications were made to SimpleScalar and Wattch in order to 
better model the multi-pipelined aspects of our processor. For future work, we 
should like to tune the hardware structures in Wattch to more accurately model 
power dissipation of our specific architecture. 

Finally, while this work has focused on saving energy by taking advantage of 
low-activity periods in the microprocessor, similar techniques may be employed 
as a means of controlling peak power. Peak power occurs when the processor 
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sustains high throughput for long periods of time. It may be necessary to limit 
throughput during these times in order to prevent reliability problems. One way 
to achieve this is by disabling sections of the processor pipeline using similar 
monitoring techniques as we described in this work. 
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Abstract. We present TEM^P^EST, a flexible, cycle-accurate micro- 
architectural power/performance analysis tool based on SimpleScalar. The goal 
was to build a “flexible” simulation tool, incorporating several estimation 
models and providing a scalable framework for future development. This 
approach is based on the fact that different power models have different 
tradeoffs in terms of power estimation accuracy and flexibility/scalability. The 
simulator generates power estimates based on either empirical data or analytical 
models. In future, other modes like estimation based on RTL extraction can be 
included. The tool includes analytical models for dynamic and leakage power, 
di/dt power, dual V, support and process technology scaling options. It has a 
thermal model built to study thermal issues and techniques like clock throttling. 
Initial studies show that our results are consistent and match well with real 
design simulated with SPICE. In addition, we validated our temperature model 
with measurement on a typical microprocessor heat solution. 



1 Introduction 

The last decade has seen a tremendous increase in microprocessor complexity, with 
designers trying to squeeze every last bit of improvement in performance. This has 
led to an inefficient use of transistors leading to high power dissipations [1]. Power 
dissipation has become a significant issue in modern microprocessor design. In fact, it 
has become one of the primary design constraints along with clock frequency and die 
size [2], In the mobile processor segment, increased power dissipation leads to 
decreased battery life and hence can jeopardize the marketability of the product. In 
case of high performance microprocessors, high power dissipation leads to thermal 
issues like device degradation and reduced chip lifetime. To prevent overheating, 
expensive heat sinks and packages are used, which add to the cost of manufacturing. 
Present day microprocessors are already approaching -lOOW total dissipated power 

[3]. 

Power dissipation optimizations have mainly targeted dynamic or switching power as 
it represents around 90% of the total dissipated power. However, people have now 
started looking at leakage power, which is becoming increasingly important with 
every process technology generation [4]. There has been some work on short-circuit 
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power dissipation, but this forms a negligible fraction in well-designed circuits [5]. 
Traditionally, power dissipation issues have been tackled at the process/circuit level. 
However, over the last few years, it has become quite evident that “power aware” 
micro-architecture definition can go a long way in reducing the power dissipation. 

In this paper, we present TEM^P^EST, a cycle-accurate, flexible and scalable tool for 
power/performance analysis, based on SimpleScalar [6]. It supports simulation in the 
empirical mode (using real design data) and the analytical mode (using analytical 
models). It computes dynamic and static power, di/dt power, thermal statistics and has 
several features, which are described in subsequent sections. The simulator can be 
used by micro-architects and compiler designers to study tradeoffs, and come up with 
power-efficient architectures and compiling techniques. 

The next section looks at some prior work and motivation behind developing this 
simulator. Section 3 describes the structure of the simulator. Section 4 describes the 
power estimation modes, and Section 5 describes some preliminary validation results 
and avenues of future development. Einally, Section 6 concludes the paper. 



2 Motivation 

Micro-architectural power estimation has been a hot field of research for the last few 
years. Several micro-architectural power estimation methodologies have been studied. 
These can be broadly classified into empirical methods and analytical methods [7] 
and into fixed activity and activity sensitive methods[8]. Until recently, researchers 
have predominantly focused on caches, due to their relatively large on-chip area and 
their regular structure, that makes them easier to model [9], [10]. However, with the 
significantly higher power dissipation in the data-path of modern out of order 
superscalar processors [11], it makes sense to look at the whole chip rather than just 
caches. The earliest effort in this direction was ESP [12], a power simulator based on 
a simple five stage RISC pipeline. In the last couple of years, at least three significant 
full chip micro-architectural power estimation tools have been unveiled [13], [14], 
[15], both based on SimpleScalar. Our work is the extension of the Cai-Lim simulator 

[13]. 

The Cai-Lim model and the Wattch [14] use different estimation methodologies. The 
former uses a very detailed empirical model using power density data from real 
design, while the latter uses analytical models. Though the Cai-Lim model is more 
detailed, it is difficult to scale to other technologies and designs. TEMPEST tries to 
bridge this gap by providing an analytical mode in addition to the empirical mode. 
Another goal was to have a highly flexible and scalable infrastructure which would be 
amenable to future development. In its present stage of development, TEM^P^EST 
provides two different modes of simulation, which compute dynamic power, leakage 
power and di/dt power. They have support for dual V, technology and technology 
scaling. The analytical models have several features like choice of dynamic and static 
decoders, dual and single rail sensing and miller capacitance corrections. We have a 
temperature model built into the simulator, which to our knowledge is done by no 
other simulator. This model can be used to study thermal distribution and mechanisms 
like clock throttling. The power estimation structure has been modularized and 
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decoupled from the performance simulator, which makes it easy to port the power 
estimator to another cycle-accurate performance simulator with minimal effort. 

TEM^P^EST can be used to study power optimization techniques at architectural as 
well as compiler level as described in [14]. It can be used to study clock gating, clock 
throttling strategies, di/dt power, and thermal gradients, which are crucial issues in 
today’s processor design cycles. 



3 Structure 

The structure of the simulator has been geared towards modularity and ease of 
extension. All the functions used in the power simulator are neatly organized into 
different modules. All the configuration and technology data is read from two files 
and added to a power database. The power database is used to provide information to 
power models. Some of the key files in the simulator are: 

tech.[c,h]: This module contains functions for calculating scaling factors and 
assigning proper values to the various technology dependent variables like /, etc. 
The routines read data from the technology definition file. The header file contains all 
the device dimensions used in the basic analytical models (O.Spm) [16]. 

anal.[c,h]\ This module contains all the basic analytical models like decoder, word- 
line etc. [16]. 

power. [c,h]'- This is the power simulator core. It contains routines for processing the 
configuration file, block level analytical models (e.g. Cache) and temperature models 
and the power and temperature update routines. Eig. 1 shows the overall file structure 
schematically. 




Fig. 1. File structure of TEM^P^EST. 

The processor is defined in terms of several Functional Unit Blocks (FUBs). Each 
FUB represents a part of or the whole architectural block. For example, itlb 
represents the entire instruction TLB where as, ill tag represents only the tags for the 
Ll-cache. Presently, we have 32 FUBs that include most of the blocks that form a 
modern out-of-order processor. New FUBs can be added for finer granularity without 
making any major changes. 
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3.1 Functional Unit Block (FUB) Design 

The design of the simulator is FUB-centric (i.e. all the data related to a FUB is stored 
in one data structure). This includes the power constants under different power 
conservation strategies, power thresholds, power statistics and junction temperature. 
Example of the structure used is shown below. 



double active_power ; 



typedef struct { 
char name [32] ; 
gating 

double active_power_rd; 
double active_power_wr ; 
double static_power ; 
double inactive_power ; 

double active_power_cg; //with clock gating 
double active_power_wr_cg; 
double active_power_rd_cg; 
double inactive_power_cg; 
double maxpowerth; 
double maxdidtth; 
double cum_power; 
double prev_power; 
double max_power; 
double max_didt; 
i n t max p o we r x ; 

max_didtx; 
double t j ; 

} 



//without clock 



//threshold power 
//power stats. 

//maximum power 
//junction temperature 



int 



f ub_t ; 



A typical use of the power constants is shown below, 
if (!clk_gate) { 

ialu->cum_power += ialu->active_power*act 

+ ialu->inactive_power* ( 1-act ) ; } 

else { 

ialu->cum_power += ialu->active_power_cg*act 

+ ialu->inactive_power_cg* (1-act) ; } 

The above defined function first checks if clock gating is active. If yes, it uses the 
clock gated power constants else it uses the basic power constants. One can view a 
scenario in which several power saving strategies are used. The maximum power 
threshold and the di/dt power threshold for the FUB can be specified in the 
configuration file. This can be used to check the number of threshold violations per 
unit time and trigger some power saving feature if exceeded. 
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3.2 Activity Counts 

The activity counts are collected using an array of counters. Presently, the simulator 
has 69 activity counters. The number is significantly larger than the performance 
simulation activity counts in order to get more details. For example, the cache 
accesses have been further classified as writebacks, replacements, and invalidations. 
Extra activity counts can be added very easily just by adding a #define and inserting 
the activity counter at the appropriate place in the performance simulator. 



3.3 Power Computation 

The power computation routine is the only routine that is called every cycle during 
the performance simulation. The routine contains computation functions for each 
FUB. Since all the power constants are pre-calculated, the addition of this routine is 
only a small overhead for the simulation time. Nevertheless, since the temperature 
computation is cycle-to-cycle dependent, it is difficult to get away with pre- 
calculation altogether. Creating a lookup table instead of computation is a code size 
versus computation time tradeoff issue. 



3.4 Configuration Files 

The following files are used in addition to the architectural configuration file. 



FUB Configuration File: This file contains the power density and area data for the 
FUBs to be simulated in the empirical mode. It also contains the power thresholds for 
each FUB as well as the full chip power thresholds. The mode of simulation for every 
FUB can be defined in this file. In addition, the configuration file contains physical 
structure data for the FUBs. These include bit-line partitions, word-line partitions, set 
partitions, type of decoder (dynamic vs. static), and read mode (dual rail vs. single 
ended). The file also contains the thermal parameters, such as ambient temperature, 
maximum power dissipation supported by the package, thermal resistance, thermal 
capacitance, and heating/cooling time constants. 

Technology Configuration File: This file contains the process technology specific 
data like the effective channel length, supply voltage, operating frequency, high and 
low V,s and the corresponding leakage currents. 



4 Modeling Methodology 

The power is calculated using the simple formula: power = (power 

density)*(area)*(activity). The power dissipation is computed for each FUB 
separately, and then added to get the full chip power dissipation. Fig. 2 illustrates the 
control flow of the simulator. The power constant calculation is presently performed 
in one of two modes: empirical mode, which uses real power data and analytical 
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mode, which uses parameterized analytical models. Other modes, for example, RTL 
extraction based simulation, can be added easily. Fig. 3 shows a schematic of the 
power constant generation. 




Fig. 2. Power simulator control flow. 




Fig. 3. Power constant generation process. 
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4.1 Empirical Mode 

The prime reasons for supporting the empirical mode are that many synthesizable 
logic blocks cannot be modeled using analytical models. Also, it is difficult to build a 
detailed clock distribution and interconnection power model, because there are many 
transistor sizing and layout issues that cannot be easily addressed in analytical 
models. 

In the empirical mode, the simulator takes empirical power density and area data as 
inputs to calculate the power constants. This mode of simulation can be used for 
generating relatively accurate power numbers. The input data is provided is the active 
and inactive power density, and the area of each FUB. In order to get more detail, we 
divide these numbers into five parts depending on the type of circuits used. This can 
be used to study the power distribution amongst different types of circuits used in the 
processor. Presently, the types of circuits considered are dynamic logic, static logic, 
clock, memory type regular array and PL A circuits. The activities are provided every 
cycle by the performance simulator. The power calculation is implemented as follows 

Active power constant = {active power density) * (area ) , 

circuits 

Inactive power constant = ^ (inactive power density) * (area ) , 

circuits 



Power = (active power constant)*(activity) + (inactive power constant)*(l-activity). 

This mode can be especially useful in industry where real data is available. However, 
there are a couple of shortcomings to this method. In order to get this data, each FUB 
has to be modeled and simulated, which is not an easy task. It is difficult to perform 
architectural feasibility/scalability studies. Changing the FUB configuration implies 
re-computing or at least re-estimating the data, which is time consuming. In order to 
get around this, we have included the analytical mode. These two simulation modes 
can be used simultaneously i.e., one of the FUBs can use the empirical mode while 
another can use analytical mode and so on. 

4.2 Analytical Mode 

The analytical models for FUBs with regular design can be easily and accurately 
generated. Time-delay model extensions are also possible [16]. In the analytical 
mode, power constants are generated using models provided. Presently, we have the 
capability to model most of the regular and simple logic based structures. The models 
are based on the models used by Wilton and Jouppi [16]. The idea is to break FUBs 
into smaller components like, decoder buffers, word-lines, bit-lines, sense amplifiers, 
etc. The analytical models compute the effective switching capacitance using some 
basic design. The power constants can then be calculated by 

Power constant = C^f^V^/f, 

where, Q^is the effective switching capacitance, V]„is the voltage swing and /is the 
frequency. We have made several refinements to the models that are described below. 
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4.2.1 Scaling 

We have included scaling factors in all the models. These are computed from the 
technology file provided. The scaling feature allows us to study power dissipation 
trends in successive process technologies. Technology data has been obtained and 
extrapolated from public domain data from TSMC (Taiwan Semiconductor 
Manufacturing Company) and several other publications [4] . 

4.2.2 Leakage Power 

The leakage power is estimated by calculating the number of leaking devices, and 
scaling them with the leakage current from the technology configuration file. We have 
built in support for multiple technology, which is becoming more important as 
feature sizes continue to shrink. 

4.2.3 Miller Capacitance Correction 

Miller capacitance is the capacitance between the gate and the source/drain regions of 
the transistor, also known as the overlap capacitance. The value of Miller capacitance 
can directly affect the power consumption of an integrated circuit. As feature size 
reduces, the gate capacitance scales down but the miller capacitance does not scale 
that well, and hence becomes a significant fraction of the switching capacitance. This 
enhancement will help to get a better estimate for very high frequency 
microprocessors based on 0.13|im technology and beyond. One way of estimating this 
factor is by sweeping the channel length of an inverter while monitoring the total 
capacitance and extrapolating to zero channel length. The residual capacitance is the 
miller capacitance, which can be used as a correction factor. 

4.2.4 Circuit Style Choice 

We provide a choice of static and dynamic circuits for the decoders because the two 
technologies can have power dissipations quite different from each other. 

4.2.5 Design 

The design of the bit-line power module includes some features used in the latest 
designs. These include certain pre-charging techniques, bit-line isolation and single- 
ended sensing. Single-ended sensing is especially important as it is being used in 
register files in many of the processors. 

The basic analytical models provided as part of the simulator are: decoder buffer, 
decoder, word-line, bit-line, sense amplifier, output MUX and a generic MUX. We 
have constructed models for many of the FUBs using these basic models. Presently, 
we do not have models for non-regular FUBs like ALU, FPU and so on. We use 
empirical data for these. The analytical models for the following FUBs are included 
Caches, TLBs, Branch Target Buffer, Return Address stack, Load/Store queue. 
Register file[17]. Register Allocation Table, and Register Update Unit [18], [19]. New 
models can be added to the simulator as and when the need arises. 
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4.3 Temperature Model 

The temperature model can be used to study thermal management impacts the 
operating frequency and cost/performance tradeoff of a design. Product reliability and 
sub-threshold leakage power are exponentially related to temperature. 

The active microprocessor functions like a space heater and the power dissipated 
translates into a temperature increase with a certain time lag. This can be modeled 
using conventional heat transfer theory, as illustrated in Fig. 4. The transfer function 
for this model in s-domain is shown below [21] in Fig. 4. 



T(s) = R, . 

PJs) R,C,s + 1 



t '“t 




Microprocess 

(R..C.) 

T 


or 









O ft o 






Fig. 4. Temperature Model. 



is the power dissipated by the microprocessor, P^^^ is the power loss to the 
environment, R, is the thermal resistance, C, is the thermal capacitance, and T is the 
microprocessor temperature. Physically, the transfer function means that power (W) is 
directly converted into temperature (°C) through the R, constant. This conversion 
experiences a time-lag governed by the first-order rate law with a time constant 
or R,C,. 

The thermal model for the simulator consists of two phases. The first phase performs 
a power to temperature computation for the current cycle, and the second phase molds 
the result to abide by the exponential first-order rate law. Each microprocessor resides 
in a package that, together with its intrinsic properties, exhibits certain thermal 
behavior. The ability of the package to remove heat to the ambient in steady-state 
condition is captured by the net thermal resistance. The power to temperature 
computation is as done as shown below, 

Silicon Temp. = Ambient Temp, -i- (Thermal Resistance) * (Power Dissipated). 

This equation is widely used to determine the maximum junction or silicon 
temperature (T,^,^) that the package supports. However, during normal operation, the 
dissipated power does not instantaneously converted into temperature due to the slow 
material response. Therefore, an exponential model is developed to reflect the natural 
response of the thermal conduction. Heating and cooling responses are modeled by 
the following equations: 
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Heating: Ar = [T_ - T,J x [1 - exp(-l/x,J] 

CoolingiAT' = [T^.J x [1 - exp(-l/x„J] 

where AT* is the temperature rise, is the maximum junction temperature that the 
package supports, Tj j is the temperature of the previous cycle, is the heating 
thermal time constant, AT' is the temperature decline, and is the cooling thermal 
time constant. 

The decision as to whether the microprocessor should he heating up or cooling down 
is made by the following criterion: 

If (instantaneous temperature generated by (T^ -i- R, Pj) is greater than Tj J then 
T, = T,,+AT* 

Else 

T, = T,^-AT- 

where T^ is the ambient temperature, T^ is the present cycle temperature, is the 
thermal resistance, and P, is the present cycle power dissipation. 

The ambient temperature, thermal resistance, heating/cooling thermal time constant, 
and maximum supported temperature are real-life design constraints that are made 
inputs to the simulator through the configuration files. With this, a multi-facet design 
optimization can be performed. 



5 Preliminary Results 

Validation of the power estimator is a big task given the rich functionality provided. 
In this section, we present some of the initial validation efforts. The validation of the 
power estimator has to be done at several levels, which are listed below: 

• Make sure that the power models generate power numbers close enough to 
real design. 

• Verify that the FUB level and full chip level power numbers make sense. 

• Verify the thermal models. 

At the lowest level, we need to check if our analytical models produce results 
comparable to real design. For this, we used the data cache as a test case. We 
designed a 128 set, one way set associate data cache with a block size of 128 bits. The 
number of output bits was also fixed to 128 bits. The design included a two-stage 
decoder, standard RAM cell with dual rail sensing, pre-charge circuits and sense 
amplifiers. The device dimensions of the RAM circuit were optimized to operate at a 
frequency of 500 MHz in 0.25 micron technology. The technology parameters were 
based on the TSMC 0.25 micron technology. The cache was simulated using SPICE 
and the following results were obtained for power dissipation. 

Power for read under 100% activity = 0.274 W. 

Power for write under 100% activity = 0. 184 W. 
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We used the same configuration for our data-cache analytical model and it gave us the 
following numbers 

Power for read under 100% activity = 0.228 W. 

Power for write under 100% activity = 0. 126 W. 

This is a 15% error in read and 30% error in write operation. We further investigated 
the problem and found several reasons for the lower power estimate we got, which are 
listed below: 

• The power model built into the simulator is based on a generic design. 

• The design that we built was custom design, and hence, the device sizes were 
not exactly the same as we modeled. 

• The power models used some power saving features used in low power 
caches like sense-amplifier isolation, etc. which was not incorporated in the 
real design. This is bound to give us a lower power estimate. 

However, we think that these results are encouraging for a preliminary test. 

Next, we ran several benchmarks on the estimator, and we got full chip numbers in 
the same order of magnitude as we expected them to be. We have shown the results of 
compress95 below. The most important outcome of this simulation is that the 
infrastructure is intact, and the subcomponents of the simulators are interfacing 
correctly. The true accuracy of the simulator depends on the individual power models 
(validated by previously published sources), and will be evaluated in future research 
papers. 

Tue Aug 22 08:46:53 2000 

Power simulation checkpoint at 1998849 instructions 
Global statistics: 

Total power = 10.923357 

Maximum power =81. 09 1713 

Maximum didt power = 57.502300 



Additional validation results have been published by Ghiasi and Grunwald at PACS 
2000 [221. The results conform to our claim that our simulator captures detailed 
activity. It did mention that the simulator is not flexible enough but with the added 
analytical models, this is no longer a concern. 

Finally, the temperature model was yalidated with a set of measurement data from a 
preyiously published source. The measurement setup is shown in Figure 5 [40]. In this 
setup, the heater plate represents the operating microprocessor, and a rubber block is 
used to apply pressure on the thermocouple against the heat sink (Al base material 
with epoxy paint). The temperature is recorded as soon as pressure is applied through 
the rubber block. The measurement data captures the thermal transient behayior, and 
it shows yery close fit (within 7%) to the model’s prediction, as shown in Figure 6. 
The most important outcome of this yalidation is that thermal transients do exhibit 
exponential first-order rate behayior (as suggested by the temperature model), and the 
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rest of the work is a matter of coding this behavior into the simulator. With that in 
mind, one can use this model with other combinations of thermal time constants, 
thermal resistances, etc. to investigate the effects of packaging to the overall 
reliability, power, and performance optimization. 



Heat Sink 



Thermocouple 




Heat Source 

Fig. 5. Temperature Measurement Setup. 




Fig. 6. Thermal Transient Behavior (Thermal Time Constant = 4 secs). 

The temperature model can then be applied to investigate clock throttling mechanism, 
as illustrated by Figure 7 below. In this particular example, the throttling mechanism 
is set to trigger at 65°C, and when this occurs, the microprocessor shuts off its clock 
to cease operation. If the microprocessor has not reached its thermal runaway 
threshold, its temperature should be cooling down. The microprocessor is 
programmed to resume normal operation when the temperature falls below 55°C (this 
gives a 10°C hysteresis). In this case, the throttling mechanism reduces the maximum 
temperature of the RUU by Td, but sacrifices overall compute time by tcpu. 

There are several avenues available for the future. Our initial efforts would be to 
complete the validation efforts. Next, we would like to use the simulator under 
various modes to study certain power reduction techniques. And more importantly, 
we will continue to enhance the functionality of the simulator, refining models, and 
trying to get results closer to real numbers as we go on. 
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Terapetaiure (C) 




Fig. 7. Register Update Unit Temperature Profile. 



6 Conclusions 

The primary goal of this paper was to present TEM^P^EST, a flexible 
power/performance analysis infrastructure. Presently, the simulator supports dual 
mode power simulation using either empirical data or analytical power models. The 
power models generate not only dynamic power estimates but also leakage power 
estimates, which is becoming significant with the shrinking process technology. 
Additional features included are technology scaling options and dual V, technology 
support. The simulator also provides a thermal model to convert the power numbers 
into a temperature profile. This is especially important for study of thermal gradients 
and strategies like clock throttling. Einally, the simulator provides a user-friendly 
configuration specification interface. Preliminary results have proven to be 
encouraging and we see a lot of scope of future research using this tool. Einally, we 
have succeeded in upgrading the simulator to SimpleScalar 3.0b, without much effort, 
which again demonstrates the modularity. 
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Abstract. We describe a new power-performance modeling toolkit, de- 
veloped to aid in the evaluation and definition of future power-efficient, 
PowerPC^^ processors. The base performance models in use in this 
project are: (a) a fast but cycle-accurate, parameterized research sim- 
ulator and (b) a slower, pre-RTL reference model that models a spe- 
cific high-end machine in full, latch-accurate detail. Energy characteri- 
zations are derived from real, circuit-level power simulation data. These 
are then combined to form higher-level energy models that are driven by 
microarchitecture-level parameters of interest. The overall methodology 
allows us to conduct power-performance tradeoff studies in defining the 
follow-on design points within a given product family. We present a few 
experimental results to illustrate the kinds of tradeoffs one can study 
using this tool. 



1 Introduction 

Power dissipation limits constitute a key new constraint in the design of high-end 
microprocessors. These limits are becoming important for two main reasons: 

Chip-level cost/performance evolution'. 

At the chip-level, there has been an ever-increasing growth in complexity, 
clock frequency and die size. This is driven by advances in semiconductor 
technology and the quest to keep up with Moore’s Law from a performance 
viewpoint. Power consumed by the processor must be dissipated by the use of 
appropriate packaging and cooling solutions. These latter solutions get more 
sophisticated and costly as power increases. As single-threaded uniprocessor 
IPC performance slowed due to fundamental ILP limits, the added com- 
plexities needed to meet net performance growth targets caused power (and 
cost) budget overruns. The impact on verification cost and time-to-market 
also started to affect the actual cost-performance trends. Multithreading and 
multiprocessing at the chip-level have evolved in an attempt to correct the 
cost-performance imbalance. That is, through architectural paradigm shifts, 
the hope is to boost the net performance significantly, but with relatively 
small power, complexity and die-size increments. However, packaging tech- 
nology limits within the “air-cooling” regime, have forced designers to look 
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for power-saving opportunities at all levels of high-end design, irrespective 
of the microarchitectural paradigm adopted. 

System-level, volumetric cost/performance evolution: 

At the very high-end, one can argue that the system cost (with all the 
“box”, memory and peripheral chip/board and switch costs) is so much 
higher than a single processor chip cost that worrying about power/cost 
issues at the single chip level is irrelevant. However, this is not a correct 
perspective. Within the form factor limits of a server “box” , if one is forced 
to increase the volume occupied by cooling hardware (e.g. through the use 
of refrigeration or liquid cooling), then that takes away from the number of 
processors which could have been included. Also, there are some basic upper 
limits on the amount of current drawn by a server box to meet the overall 
processing and cooling needs. Thus, the volumetric cost/performance growth 
constraints translate into a “per-chip” power limit for a particular system 
product targeted for a given market. 

Thus, early-stage microarchitecture definition and trade-off analysis studies 
must now (more than ever before) try to factor in considerations of power and 
design complexity. Recently, there have been papers from academia and industry 
[1,3,10,11,12,13] that address the issue of modeling and design of power- and 
complexity-aware microarchitectures. In this paper, we report ongoing work in 
this area within IBM Research. The power-performance modeling methodology 
described in the prior work of Brooks et al. [1] is adapted for use within the 
modeling framework of a real, server-class processor development project. The 
key new contributions in this power-performance modeling tool are: 

— Energy models that are derived from real, circuit-level power simulation 
data, but are then driven by microarchitecture- level parameters of in- 
terest. These higher-level abstractions are suitable for conducting power- 
performance tradeoff studies to define the follow-on design points within 
a given product family. Technology parameters and scaling equations are 
additional inputs to the model. 

— A web-based graphical user interface, which allows one to quickly charac- 
terize the fundamental tradeoffs between performance growth and power- 
related cost, based on prior, one-time simulation data collected in a spread- 
sheet database. 

Using this new modeling toolkit, we evaluate a current generation, high- 
end PowerPC processor design point from the viewpoint of power-performance 
efficiency. As part of this evaluation, we examine the sensitivity of such efficiency 
metrics with respect to individual (and combinations of) microarchitecture-level 
parameters: cache size and geometry parameters, queue/buffer sizes, number of 
ports to various storage resources, various other bandwidth parameters, etc. 
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Fig. 1. Block Diagram of PowerTimer. 



2 PowerTimer: An Energy-Aware Performance 
Simnlation Toolkit 

Figure 1 shows the high-level block diagram of PowerTimer, our energy-model- 
enabled performance simulator. The basic methodology is similar to earlier mod- 
els, like Wattch [1]. 

The energy models are derived from circuit-level power simulation data, col- 
lected on a detailed, macro-by-macro basis. These models are controlled by two 
sets of parameters: (a) technology/circuit parameters, which allow appropriate 
scaling from one CMOS generation to the next; and (b) microarchitecture-level 
parameters: various queue/buffer sizes, pipe latencies and bandwidth values. 
These latter parameters also drive the base performance simulator in the usual 
manner. The energy models can be used in two different modes. First, the perfor- 
mance simulator can be used standalone, to produce detailed CPI and resource 
utilization statistics. These can then be processed through the energy models to 
generate average, unit-wise power numbers. Second, the energy models can be 
embedded in the actual simulation code, so that they are “looked up” as needed 
on a cycle-by-cycle basis. This mode allows one to view the cycle-by-cycle energy 
characteristics in more detail; but the average statistics at the end of the run 
would obviously be the same as in the first mode. 

2.1 Energy Model Construction 

In the Wattch simulator [1], and in other similar toolkits [12,13], analytical 
capacitance models were developed for various high-level block-types, such as 
RAMs, CAMs and other array structures, latches, buses, caches, and ALUs. 
While some of the characterizing parameters are gross length and width values 
which a logic-level designer or microarchitect could relate to, others were at a 
much lower (circuit or physical design) level. In our current (PowerPC-specific) 
work, the goal is to form unit-specific energy models controlled by parameters 
familiar to a high-level designer or microarchitect. Thus, for example, once a 
characterizing equation has been formed for one of the issue queues, one is able 
to play “what-if” games in PowerTimer, by simply varying the queue size as 
normally done in microarchitectural performance simulation. 
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Fig. 2. PowerTimer Energy Models. 



Figure 2 below depicts the derivation of the energy models in more detail. The 
energy models are based on circuit-level power analysis that has been performed 
on structures in a current, high performance PowerPC processor. The power 
analysis has been performed at the macro level; generally, multiple macros com- 
bine to form one micro-architectural level structure which we will call a sub-unit. 
For example, the fixed-point issue queue (one sub-unit) might contain separate 
macros for storage memory, comparison logic, and control. Power analysis has 
been performed on each macro to determine the macro’s power as a function 
of the input switching factor. The hold power, or power when no switching is 
occurring, is also generated. These two pieces of data allow us to form simple 
linear equations for each macro’s power. The energy model for a sub-unit is de- 
termined by summing the linear equations for each macro within that sub-unit. 
We have generated these power models for all microarchitecture-level structures 
(sub-units) modeled in our research simulator [8,9]. In addition to the mod- 
els that specify the power characteristics for particular sub-units (such as the 
fixed-point issue queue), we can derive power models for more generalized struc- 
tures; for example, a generalized issue queue model. These generalized models 
are useful for estimating the power cost of additions to the baseline microarchi- 
tecture. The generalized model is derived by analyzing the power characteristics 
of structures within the baseline microarchitecture. For example, the fixed-point, 
floating-point, logical-op, and branch-op queues have very similar functionality 
and power characteristics and the energy analysis for these queue structures has 
been used to derive a generalized issue-queue power model based on parameters 
such as the number of entries, storage bits, and comparison operations. 

Since we are interested in determining power-performance tradeoff analysis 
for future microarchitectures within a particular product family, we must de- 
termine a method of scaling the power of microarchitectural structures as the 
size of these sub-units increases. The scaling factor depends on the particular 
structure; for example, the power of a cache array will scale differently than 
that of an issue queue. In addition, as resources increase in size, they necessarily 
cause other structures to become larger. For example, as the number of rename 
registers increases, the number of tag bits within each entry of the issue queues 
increases. Generally, as we increase the number of entries in a structure, there 
will be a proportional increase in the power. For this reason, we use linear seal- 
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ing as a basis for many of the structures that we consider. In addition, we have 
performed detailed analysis on the scaling of queue and mapper structures. For 
these structures, we have determined the average power per storage bit and per 
comparison operation. As the queues and mappers increase in size, we compute 
the number of storage bits and comparisons that occur for the larger structures. 
We also use previously published work on power scaling within cache arrays 
which we discuss in Section 3.3. 

2.2 Web-Based Interface and Power-Performance Metrics 

In order to thoroughly explore the modeled design space, we selected 19 work- 
loads (8 SPECint95, 10 SPECfp95, and TPC-C) each of which was evaluated 
for over 75 hardware configurations. Analyzing this amount of data is difficult 
and a GUI makes the results of our analysis useful to our colleagues within IBM 
Research, as well as designers within the IBM Product groups. We developed 
a web-based back-end analysis tool which allows the user to select the bench- 
marks of interest and the microarchitectural parameter(s) to vary as well as the 
technology parameters such as frequency, voltage, and feature size. 

The tool also allows the selection of various power-savings features such as the 
style of conditional clocking within the microarchitecture. Finally, the tool pro- 
vides the choice of five power-performance metrics: Average CPI, average power 
dissipation, CPI *power, (CPI)^ *power, and (CPI)^ *power. The latter three 
metrics correspond to energy, energy-delay product [5,2], and energy + delay‘d. 
Since power is proportional to the square of supply voltage (Vdd) multiplied 
by clock frequency, and clock frequency is proportional to Vdd, power is pro- 
portional to the cube of Vdd. Thus delay cubed multiplied by power provides 
a voltage-invariant power-performance characterization metric which we feel is 
most appropriate for server-class microprocessors. In the remainder of the paper 
we will present our power-performance results as (CPI)^ * power. 

3 Power-Performance Evaluation Examples 

In this section, we first provide a high-level description of the processor model 
assumed in our simulation toolkit. Then, we present some example experimental 
results with analysis and discussion. The results were obtained using our current 
version of PowerTimer, which works with pre-silicon performance models used 
in defining future PowerPC structures. 

3.1 Base Microarchitecture Model 

For the purposes of this paper, we assume a generic, parameterized, out-of-order 
superscalar processor model adopted in a research simulator called Turandot [8, 
9]. The overall pipeline structure (as reported in [8]), is repeated here in Figure 
3. The modeled microarchitecture is similar in complexity to a current gener- 
ation microprocessor (e.g. [4,7] ). As described in [8], this research simulator 
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Fig. 3. Processor Organization Modeled by the Turandot Simulator. 



was calibrated against a pre-RTL, detailed, latch-accurate processor model (re- 
ferred to as R-model in [8]). The R-model is a custom simulator, written in C-I-+ 
(with mixed VHDL “interconnect code”). There is a 1-to-l correspondence of 
signal names between the R-model and the actual VHDL (RTL) model. How- 
ever, the R-model is about two orders of magnitude faster than the RTL model 
and is considerably more flexible. Many microarchitecture parameters can be 
varied, albeit within restricted ranges. Turandot, on the other hand is a classical 
trace/execution-driven simulator, written in C, which is 1-2 orders of magnitude 
faster than R-model. It supports a much greater number and range of parameter 
values. 

In this paper, we report power-performance results using the same version of 
R-model that was used in [8]. That is, we first used our energy models in con- 
junction with the R-model: this ensured accurate measurement of the resource 
utilization statistics within the machine. To circumvent the simulator speed lim- 
itations, we used a parallel workstation cluster; also, we post-processed the per- 
formance simulation output and fed the average resource utilization statistics 
to the energy models to get the average power numbers. This is faster than the 
alternative of looking up the energy models on every cycle. While it would have 
been possible to get instantaneous, cycle-by-cycle energy consumption profiles 
through such a method, it would not have changed the average power numbers for 
entire program runs. Having used the detailed, latch-accurate reference model for 
our initial energy characterization, we were able to look at the unit- and queue- 
level power numbers in detail in order to understand, test and refine the various 
energy models. Currently, we have reverted to using an energy-model-enabled 
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Turandot model, for fast CPI vs. Power tradeoff studies with full benchmark 
traces. Turandot allows us to experiment with a wider range and combination 
of machine parameters. In future publications and talks based on PowerTimer, 
we plan to report these results in detail. 

3.2 Workloads Used in the Study 

In this paper, we report experimental results based on the SPEC95 benchmark 
suite and a commercial TPC-C trace. All workload traces are PowerPC-based. 
The SPEC95 traces were generated using the tracing facility called Aria within 
the MET toolkit [9]. The particular SPEC trace repository used in this study 
was created by using the full reference input set. However, sampling was used to 
reduce the total trace length to 100 million instructions per benchmark program. 
A systematic validation study to compare the sampled traces against the full 
traces was done, in finalizing the choice of exact sampling parameters. The TPC- 
C trace used is a contiguous (i.e. unsampled) trace collected and validated by the 
processor performance team at IBM Austin. It is about 180 million instructions 
long. 

In the following three sections we present examples of the use of the Power- 
Timer simulation infrastructure. The results show the average CPI and average 
{CPiY * power for the traces described above. Each SPEC data point was ob- 
tained by averaging across the benchmark suite. Note, however, that we have 
excluded apsi from the SPECfp results due to a problem with these simulation 
runs. 

3.3 Data Cache Size and the Effect of Scaling Techniques 

In this section we evaluate the relationship between performance, power, and LI 
data cache size. We vary the cache size by increasing the number of cache lines 
per set while leaving the linesize and cache associativity constant. Figure 4a and 
4b show the results of increasing the cache size from the baseline architecture 
(points labeled lx on the x-axes). Figure 4a illustrates the relation between the 
cache size in the first-level data cache and the relative CPI for the workloads that 
we studied. The CPI value for each cache size is computed as a ratio, relative 
to the base lx CPI for that workload. Figure 4b shows the relation when we 
consider the metric {CPiY * power. From Figure 4a, it is clear that the small 
CPI benefits of increasing the data cache are outweighed by the increases in 
power dissipation due to larger caches. 

In Figure 4b, we show the results with two different scaling techniques. The 
first technique assumes that power scales linearly with the cache size. As the 
number of lines is doubled, the power of the cache is also doubled. The second 
scaling technique is based on data from [6] which studied energy optimizations 
within multi-level cache architectures. In [6], data is presented for cache power 
dissipation for conventional caches with sizes ranging from 1KB to 64KB. 

In the second scaling technique, which we call “non-lin” in Figure 4b, the 
cache power is scaled with the ratios presented in [6]. The increase in cache 
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Fig. 4. Variation of Performance and Power-Performance with Cache Size. 



power by doubling cache size using this technique is roughly 1.46x, as opposed 
to the 2x with the simple linear scaling method. Obviously the choice of scaling 
technique can greatly impact the results. It is clear, however, that with either 
scaling choice, conventional performance-focused cache organizations will not 
scale in a power-efficient manner. (Note that the curves shown in Figure 4b 
assume a fixed circuit/technology generation; they are intended to show the 
effect of adding more cache to the current design.) 

3.4 Number of Completion Buffers 

In the target microarchitecture, the number of completion buffers determines 
the total number of instructions that can be active within the machine. The 
completion table is very similar to a re-order buffer in that it tracks instructions 
as they dispatch, issue, execute, wait for exceptions, and complete. Figures 5a 
and 5b show the effects of varying the number of completion buffers on perfor- 
mance and the power-performance metric. From Figure 5a, it is evident that 
little additional performance is gained by increasing the number of buffers past 
the current design point (lx). When considering (CPI)^ * power in Figure 5b, 
we see that power-efficiency is slightly degraded by increasing the number of 
entries due to a roughly 3% increase in the core’s power dissipation. 

3.5 Ganged Sizing 

Out-of-order superscalar processors of the class considered, rely on queues and 
buffers to efficiently decouple instruction execution to increase performance. The 
depth of the pipeline and the sizes of the resources required to support decoupled 
execution (queues, rename registers, completion table) combine to determine the 
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Fig. 5. Variation of Performance and Power-Performance with Number of Completion 
Buffers. 





Fig. 6. Variation of Performance and Power-Performance with Core Size (ganged 
parms) . 



performance of the machine. Because of this decoupled execution style, increasing 
the size of one resource without regard to the other resources in the machine 
may quickly create a performance bottleneck. Thus, in this section we consider 
the effects of varying multiple parameters rather than just a single parameter. 

Figure 6a and 6b show the effects of varying all of the resource sizes within 
the processor core. This includes issue queues, rename registers, branch predictor 
tables, memory disambiguation hardware, and the completion table. For the 
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buffers and queues, the number of entries in each resource is scaled by the values 
specified in the charts (0.6x, 0.8x, 1.2x, and 1.4x). For the instruction cache, 
data cache, and branch prediction tables, the size of the structures are doubled 
or halved at each data point. From Figure 6a, we can see that performance is 
increased by 5.5% for SPECfp, 9.6% for SPECint, and 11.2% for TPC-C as the 
size of the resources within the core is increased by 40% (except for the caches 
which are 4x larger). The configuration had a power dissipation of 52%-55% 
higher than the baseline core. Figure 6b, shows that the most power efficient 
core microarchitecture is somewhere between the lx and 1.2x cores. 

4 Conclusion 

We have described PowerTimer: a research power-performance simulator de- 
signed to help with the definition and evaluation of follow-on products within 
the high-end PowerPC microprocessor family. Based on this model, we have eval- 
uated power and performance tradeoffs using SPEC95 workloads and a TPC-C 
trace. We have presented a few selected experimental results from our analysis 
repository to illustrate the kinds of tradeoffs that one may be able to study 
using this toolkit. A web-based interface allows users to view specific power- 
performance tradeoff curves of their choice. This allows users to evaluate the 
worth and wisdom of making specific microarchitecture-level enhancements to 
an existing design point. The tool allows one to evaluate whether a certain as- 
pect of the design is inherently power-efficient or not. For example, in an initial, 
voltage-invariant “technology remap” scenario, we may like to know whether 
simply increasing the cache sizes, without perturbing the core engine would buy 
us enough performance to counterbalance any power increase. 

PowerTimer allows one to experiment with a large number of design pa- 
rameters and there are multiple choices available in terms of selecting a power- 
performance efficiency metric. We have presented just a few examples in this 
paper. For example, one can study the effectiveness of various flavors of con- 
ditional clocking to see how the sensitivity curves are affected. Also, the use of 
technology scaling parameters, allows the user to explore the future design space 
in a realistic manner. 
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Abstract. Reducing power, on both a per cycle basis and as the 
total energy used over the lifetime of an application, has become more 
important as small and embedded devices become increasingly available. 
A variety of techniques are available to reduce power, but it is difficult to 
quantify the benefits of these techniques early in the system design phase 
when processor architecture is being defined. Accurate tools that allow 
for exploration of the design space during this phase are crucial. This 
paper describes our experience with two such tools, the Cai-Lim power 
model and Wattch, which have been made available to the computer 
architecture community over the past year. We focus on how the models 
are constructed, the granularity of activity revealed by the models, the 
ability to understand why particular power results are obtained and the 
accuracy of the models. We raise concerns about detailed simulations 
where the power model, the simulator model and the desired architecture 
to be simulated differ and the validity of data obtained in such situations. 

Keywords: Power Analysis Tools, Performance Comparison, Validation, 
Architectural-definition stage 



1 Introduction 

It is increasingly important to consider power use at every phase of processor 
design. Traditionally, power estimates and analysis were performed after a design 
was laid out, typically using SPIGE analysis. As power demands have increased, 
GAD tools include power analysis earlier in the design cycle. For example, the 
Synopsys toolsuite contains tools to estimate power demands using information 
from standard cells and activity profiles from system components. The Synopsys 
PowerGompiler automatically adds clock gating, sizes transistors and suggests 
some logic transforms that may reduce power. 

Recently, the computer architecture community has attempted to model 
power during the architectural design phase. As in any GAD problem, differ- 
ent levels of abstraction, accuracy and effort are appropriate for different stages 
of the design process. During the architectural design phase, power analysis tools 
should assist in floor-planning architectural designs intended to save power and 
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coarse-grain estimates of power. At the synthesis or layout phase, power analy- 
sis tools should assist in transistor sizing, clock gating and the like. As a design 
nears completion, power analysis tools must provide accurate information which 
may affect packaging selection or system specification. As designs progress from 
initial to final stages, the tools should produce increasingly accurate informa- 
tion; that increased accuracy requires additional time and expense. Early in the 
design phase, designers need tools that quickly reflect power savings in reliable 
terms but with possibly less accuracy. 

We think that architecture definition stage power analysis tools will be used 
in one of two styles. In the first style, an architect may have an idea for reducing 
average power that could be easily added to an existing design. The designer 
would like to quickly determine if the new mechanism would actually save power 
by modifying an existing design; if it appears promising, the full design can 
be synthesized and analyzed in more detail. Tools designed to modify existing 
designs may be reasonably accurate because they can exploit detailed functional 
and circuit models from extant designs. The full processor design may not be 
greatly impacted by the modifications, resulting in less error. 

Alternatively, the designer may be starting a new processor design, possibly 
using a new process technology and microarchitecture. Power estimation tools for 
this design style face a more difficult challenge because aspects of the processor 
design that affect a large portion of the power budget (such as clocking policies 
and clock networks) may not be finalized. Thus, the power analysis tool must 
be generalizable and extensible to insure that it is useful. 

Most architectural-level power estimation tools use activity based power es- 
timation. This paper compares two activity based power models that have been 
available to the architecture community for approximately one year. We com- 
pare these models using two experiments. In the first, we try to determine if 
a processor should use in-order or out-of-order execution semantics, a question 
typically asked earlier in the design cycle. In the second experiment, we modify 
a processor design that is similar to the design used to develop the power model. 

Our results were discouraging. Using available information about the per- 
formance bounds of the power models, we found the two extant power models 
disagree on the efficacy of the design choices in each experiment and do not 
always produce statistically significant results. 

In Section 2, we examine the power models currently available. Our exper- 
iments and experimental criteria are described in Section 3. We present our 
results in Section 4, followed by our conclusions and suggested directions for 
future work in Section 5. 



2 Available Power Models 



Developing an accurate model is a time consuming task that requires detailed 
low level knowledge of circuit design and infrastructure that many researchers 
do not generally have. Two tools, the Cai-Lim power model and Wattch, are 
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currently available to researchers in the community who would like to use power 
models during the architecture definition stage. 

Each of these are fundamentally similar and rely on activity based power 
models to estimate power demands. A transistor or gate-level power model uses 
a set of input vectors, such as a specific use of an adder or cache logic, to model 
power demands of a given design. Gate level models keep detailed information 
about the state of each register, latch or bitline in the final design. This is 
time-consuming, because the architecture is simulated in great detail, but is also 
accurate because of that same level of detail - the detailed power models may 
accurately record the effects of precharging, short circuit loss and so on. Statis- 
tical or activity based gate or RTL designs are simpler. They work by assigning 
an “average power cost” for a 0,0, 0-1, 1-0 or 1-1 transition on individual device 
structures. These costs are determined by detailed models of individual circuit 
elements. The RTL model is then used to execute test vectors or applications and 
the occurrence of each action {e.g. 0-1 transition) is recorded and multiplied by 
the pre-computed activity cost. The model is less time-consuming because only 
the register state must be maintained, not the full analog circuit characteristics. 

The models we examine are statistical or activity functional power models. 
In these power models, the power demands of an entire functional unit may be 
measured over many different test vectors. The resulting power measurement 
constitutes an “average” power measurement for that functional unit and each 
use of the functional element in the microarchitectural simulator is assumed to 
use the average power. For example, a integer ALU can perform many functions 
over many inputs. In a statistical functional energy power model, the model 
developer would use a set of representative (or randomly generated) inputs to 
determine the cost or power demand of the “average integer operation”, and 
then use that value each time the integer ALU is accessed. These models typi- 
cally assume some cost for inactive power, but often assume that inactive power 
is simply 10% of active power. These models are much faster than the more 
detailed models because even less state and computation must be performed for 
each simulated execution cycle - the energy model boils down to a set of “ac- 
tivity counters” for specific functional components of the microarchitecture and 
estimated average power costs. Some of these quantities can be scaled for dif- 
ferent component sizes. For example, a memory unit can be broken into smaller 
components and a “power density” and area can be recorded for a each smaller 
component; as the size of that structure is varied, the area is used to scale the 
total power cost for accessing that unit. This is particularly useful for memory- 
intensive or array structures within a processor. 

Clearly, the accuracy of these models depends greatly on the accuracy of 
the estimate power models of each functional component and the detail with 
which component accesses are modeled. This can limit their applicability - if 
the microarchitectural model can not be or is inaccurately represented using the 
available components, the model produces meaningless results. The accuracy of 
the models is also affected by how and in what situation they are applied. We 
have found that it is necessary to understand the underlying architecture of both 
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the simulation model and the power model to exercise the appropriate caution 
in applying the power model to modified architectures. 

Both of the models we examine are based on the SimpleScalar toolset [3] 
commonly used to model microarchitectures in educational and some re- 
search environments. The basic microarchitecture simulator in SimpleScalar 
(sim-outorder) models an out-of-order microprocessor with a Register Update 
Unit [13], variable instruction fetch width, configurable memory sizes and vari- 
able branch predictor and fetch processing organization. The simulator does not 
model any specific extant processor, such as the Intel PentiumPro or Compaq 
21264, but can be configured to model processors with different resource con- 
straints. 

We now present an overview of each tool including what is provided, as well 
as information on granularity, access to details, flexibility and validation. 

2.1 Cai-Lim Power Model 

The Cai-Lim power model [5] is an activity sensitive power model built upon the 
SimpleScalar 2.0 out-of-order simulator. It partitions the SimpleScalar architec- 
ture into 17 hardware structures which are further subdivided into a total of 32 
blocks. Each block is then further partitioned into power density and area for 
both active and inactive contributions from dynamic, static, PL A, clock, and 
memory sections of the block. Area estimates are based on publicly available 
designs with additional area allocated for clocking, interconnects, and power 
supply. SPICE simulations were used to find the active power density of typical 
designs based on Taiwan Semiconductor Manufacturing Corporation’s 0.25/rm 
process files. Inactive power densities were estimated as 10% of the active power 
densities. Power density numbers are used as constants in conjunction with the 
activity counters to model power consumption. 

The Cai-Lim model is designed to provide a “structure that any micropro- 
cessor design can use with their own data” [4] . The values for power density and 
areas provided with the currently available version of the power model does not 
match a specific architecture but is based on common structures available today 
such as a 6T SRAM cell in a .25/im process. 

The Cai-Lim model tracks how a hardware structure is used by breaking it 
down into different types of accesses and then counting each time that type of 
access occurs during a cycle. For example, the level 1 data cache is comprised 
of 3 blocks - logic, cache and taglines - which are be used whenever there is 
an access, writeback, replacement, or invalidation to the cache. This structural 
breakdown and its associated information provide an opportunity for detailed 
modeling and an ability to track reductions in dynamic activity. 

All values for power densities and areas have been pre-computed and included 
as part of the source code. The lack of direct access to the models used to gen- 
erate these values makes it difficult to scale units appropriately, even within the 
same process and architectural family. For example, the original RUU modeled 
contains 16 entries. If the number of entries is to be changed, the area should be 
scaled appropriately because it makes up a larger portion of the overall design. 
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Unfortunately, linearly scaled is not necessarily the same as appropriately scaled. 
The lack of directly accessible details on structural scaling factors limits the Cai- 
Lim power model’s ability to be used to directly compute relative contributions 
to power from different blocks. The model is difficult to extend without exam- 
ining the original process files to determine how to incorporate new hardware 
structures. It is also difficult to perform scaling of components based on changes 
in size of the component, not changes in technology. 

Even with this shortcoming, the Cai-Lim model remains quite flexible, but 
of unknown accuracy. In general terms, a well designed architecture definition 
stage model is expected to be accurate to about 25%[5]. The Cai-Lim model 
currently uses the same power density for all blocks which may lead to larger 
inaccuracies than expected. 

The Cai-Lim model has been validated against the SPICE models used to 
generate the elements and power densities. Its ability to model a real architecture 
is unknown. The Cai-Lim model is not intended to provide absolute numbers for 
any design, but can provide relative numbers between designs. However, without 
an analysis of the per component contribution to the overall error in final results, 
it is difficult to assess whether a given change produces meaningful results. 

TEM^P^EST is a new version of the Cai-Lim model [6], but was not available 
for evaluation in time. The Cai-Lim model has also been adapted for use in 
SMTSIM[12]. 

2.2 Wattch 

Wattch[2] is a collection of power models. The first is an “all components always 
on” model and the remaining 3 models are activity sensitive with varying degrees 
of conditional clocking enabled. The conditional clocking options range from full 
power for a component consumed during any cycle in which it is accessed and 
zero otherwise to linearly scaled power based on component usage when accessed 
and 10% of base power when not accessed. Wattch is built upon the SimpleScalar 
3.0 out-of-order simulator that has been extended conceptually from a 5 stage 
pipeline to an 8 stage pipeline. The additional stages are inserted only for the 
global power calculation and have no other effect on the system. Wattch devel- 
oped basic components - array structures, content-addressable memories, combi- 
national logic and wires, and clocking. These components are then used to build 
a parameterized model of major hardware structures. It again does not model a 
particular architecture but its use of an H-tree clock distribution and separate 
accounting of clock power resembles an Alpha. Rather than using a power den- 
sity and area based model, capacitances are calculated from wire delays[10,ll] 
and then used to generate a per cycle cost for a structure. These costs may then 
be scaled by usage indicated by activity counters. 

Wattch is designed to provide a framework upon which future work can 
be built. Wattch provides an infrastructure for comparisons between different 
power reduction techniques and provides a parameterized model that can quan- 
tify power reductions. It also exposes the underlying details of its power model 
to the users. Wattch claims an accuracy within 10% of layout-level power tools. 
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Wattch’s published results indicate an average accuracy of +/-13% when com- 
paring relative power against known relative powers for implemented architec- 
tures (Pentium Pro and Alpha 21264). A per component analysis of Wattch’s 
accuracy is not available in currently published works. 

Wattch tracks how a hardware structure is used in a manner similar to that 
employed by the Cai-Lim model but does not currently support the same level 
of granularity. In addition, some of the structures are based on an all or nothing 
power calculation while others, such as branch prediction, use a specified cost 
for up to 2 * someLimiting Factor. Counters for different types of accesses are 
employed, but some details are left out. For example, Wattch tracks accesses to 
the level 1 data cache, but does not currently track invalidations, replacements, 
or writebacks. It does take into account logic, array (cache in the Cai-Lim model) 
and tag contributions to the total power of the cache. 

Wattch provides greater access to the underlying details of the models than 
the Cai-Lim model provides. Researchers are provided with information in the 
code about how the power numbers are derived and structural scaling factors 
are calculated by Wattch during an initialization phase of the program. Tech- 
nology scaling factors are included for processes ranging from .lO^m to .80/rm. 
Wattch uses this information as well as the SimpleScalar configuration specified 
to initialize power values for different aspects of the hardware structures. For 
example, the power calculations for the instruction window are dependent on 
the process size as well as the RUU size. 

Wattch provides some of the flexibility needed by researchers but has short- 
comings that can limit its usefulness. The lack of granularity in access counting 
limits Wattch’s ability to identify activity reduction power savings. The underly- 
ing hardware structure breakdown provides users with the opportunity to build 
customized components to study specialized hardware, but with unknown accu- 
racy. Wattch also does not provide information about the contributions to total 
error by the various components, leaving researchers with little way to judge the 
validity of designs that involve components of sizes that differ from published 
implementations. A detailed correlation study would reduce this last concern. 



3 Methodology 

We have attempted to place Wattch and the Cai-Lim power model on as similar 
a footing as possible. Wattch consists of 4 power models. We have selected the 
Wattch conditional clocking power model (model cc3 in Wattch’s source code) 
that most closely matches the active and inactive contributor model used by the 
Cai-Lim model. Both models calculate the active circuit power and then set the 
inactive circuit power to 10% of the active circuit power. 

3.1 Experiments 

Our experiments, described below, were performed using a modified SimpleScalar 
3.0 out-of-order simulator that contained both Wattch and the Cai Lim model. 
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Table 1. Pipeline parameters for our base pipeline simulations. 



Pipeline Simulation Configurations 


Parameters 


4-wide 


8-wide 


Machine Width 


4-wide fetch, 4-wide issue, 
4-wide commit 


8-wide fetch, 8-wide issue, 
8-wide commit 


Window Size 


64 entry register update unit, 
32 entry load/store queue 


256 entry register update unit, 
128 entry load/store queue 


Branch Misprediction 


min. recovery latency 6 cycles 


LI Icache 


64kB, 32 byte lines, 2-way set-associative, 2 cycles hit latency 


LI Dcache 


64kB, 32 byte lines, 2-way set-associative, 2 cycles hit latency 


L2 Cache Combined 


512kB, 64 byte lines, direct mapped, 6 cycles hit latency 


Memory 


128 bit wide, 26 cycles access latency 


BTB 


1024 entry, 4-way set-associative, 32 entry return address stack 


TLB 


64 entry (I), 128 entry (D), fully associative, 30 cycle miss latency 


Functional Units and 
Latency (total/issue) 


4 IntALU(l/l), 1 IntMult(3/l) / Div(20/19), 
2 Load/Store(2/l), 4 FPAdd(2/l), 

1 FPMult(4/l) / Div(12/12) / Sqrt(24/24) 


8 IntALU{l/l), 2 IntMult(3/l) / Div{20/19), 
4 Load/Store(2/l), 8 FPAdd(2/l), 

2 FPMult(4/l) / Div(12/12) / Sqrt(24/24) 



To ensure the combined models produced independent results, we also per- 
formed a separate data independence experiment to confirm that a simulator 
containing only Wattch or only the Cai-Lim model produced the same results 
as our combined simulator. We found this to be true for data runs in all our 
power experiments. We ran the SPEC95INT benchmarks under our combined 
Wattch/Cai-Lim simulator using the configurations shown in Table 1. 



Low Power Mode for IPC Matching. In prior work, we explored the pos- 
sibility of exploiting naturally occurring variations in program IPC to reduce 
power consumption [8]. We found that our MPEG benchmarks were able to spend 
a significant amount of time in a dynamically pipeline gated fetch mode, in an 
in-order issue mode, or a combination of these two modes depending on the 
characters of the time- varying IPC. To continue this work, we undertook a limit 
study to explore the effects of running applications under dynamic pipeline gat- 
ing, in in-order issue mode, or in the base case for the next smaller 2” machine 
width (half-width) for the duration of the applications run. A single set of results 
consists of a base case (8- wide or 4- wide), pipeline gated run, an in-order issue 
run, and a half-width base case (4-wide or 2- wide). The lower power modes we 
investigated involve no explicit conditional clocking on unused sections of circuit 
blocks. The power savings observed stem from reduced activity throughout the 
pipeline and the conditional clocking provided by the power models. 



Dynamic Instruction Queue Resizing. Folegnani et al. introduced a sepa- 
rate low power optimization [7] that extended the design of the current Compaq 
Alpha 21264 architecture. They found that the ready instruction queue was of- 
ten underutilized and that the extent of this underutilization varied over an 
application’s lifetime. A special tag is added to instructions issued indicating if 
they were issued from the last segment of the current instruction queue. When 
no instructions are issued from the last segment for some number of cycles, the 
segment is powered off, reducing the power consumed by the queue as well as 
reducing the power needed for writebacks and selections. We performed simi- 
lar simulations using the basic 8- wide and 4- wide configurations from Table 1. 
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Power reductions observed under dynamic instruction queue resizing are due to 
explicitly shutting down a portion of the instruction queue. Power consumption 
may change in other blocks as well because the distribution of the workload is 
changed by applying dynamic instruction queue resizing. 

3.2 Evaluation Criteria 

Both models are designed to demonstrate relative power savings between designs. 
We have selected our criteria on this basis. Each application run is normalized 
against the power results of the appropriate base case so that relative power 
reductions can be compared. We examine per cycle power and energy where 
appropriate. Wattch claims an accuracy of within 10% of the results provided by 
layout-level power tools. We use this 10% estimate of error to evaluate whether or 
not the results from Wattch cc3 model and the Cai-Lim model are statistically 
different from each other as well as statistically different from the base case. 
We believe the actual error in the models may be larger, implying that the 
results from additional experiments will fall below the threshold of statistical 
significance. This has implications for comparisons between Wattch and the Cai- 
Lim model within a single data set as well as for comparisons between power 
reduction techniques and a base case without the technique under either model. 

3.3 Statistical Significance 

The analysis of errors under either power model should take into account the 
initial random errors in calculations and then the propagation of these errors 
throughout the remainder of the power model. We have not done this in detail 
for the power models we investigated, but we hope it will be done for power 
models in wide-spread use. 

We use Wattch’s published error estimation as the sample variance and use a 
sample size of one to calculate the 90% confidence level. We are most interested 
in whether a run differs from its base case. If the resulting confidence interval 
includes the base value we cannot say the new result is statistically different 
from the base case result. Applying this technique leads to a range of approxi- 
mately -I-/- 16%. Normally a t-test is applied in situations where the confidence 
intervals overlap with each other, but do not include the base case result. A 
t-test is not possible with a sample size of one, so the visual test for signifi- 
cance was applied instead. An example of a visual test for significance is shown 
in Figure 1. Throughout the remainder of the paper, when we discuss statisti- 
cal significance we are using the 90% confidence level to determine significance. 
Additional information on confidence intervals can be found in [1,9,14]. 

4 Results 

For each experiment, we briefly discuss our expected results. We then present 
our actual results including the measure of statistical significance developed in 
Section 3. Finally, we analyze why the results differ from each other. 
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A 

A 



Not significant Significant 

Fig. 1. The overlapping confidence intervals on the left fail the visual test for signifi- 
cance. The non-overlapping intervals on the right indicate a signihcant result. 



4.1 Low Power Mode for IPC Matching 

Figure 2 shows a subset of the results for this experiment. These benchmarks 
were selected because they represent the outliers in the 4- wide case. Other bench- 
mark results fall between these. All runs shown are normalized against a base 
case where no power reduction techniques have been applied. We are primarily 
interested in per cycle power because our intent is to discover appropriate lower 
power modes that can be applied dynamically in response to natural variations 
in the IPC over the lifetime of the program. We also examine the energy to see 
if it is significantly different than the base case and if it would lead us to draw 
different conclusions if energy reduction were our primary goal. 

Our expectation was that the three techniques applied would each result in a 
power reduction with in-order issue producing the greatest relative reduction in 
power. It was unclear if energy would also be reduced because of its dependence 
on the number of total cycles executed. Finally we were primarily interested in 
which technique(s) would provide us with the largest relative power reduction. 



Results. Applying the pipeline gating technique for the entire duration of the 
benchmarks runs indicates a statistically significant reduction in relative per 
cycle power under both 4-wide and 8-wide microarchitectures (see Figure 2 (a) 
and (c)) . However, the results for the Cai-Lim model and Wattch are also 
significantly different from each other in our 4- wide microarchitecture results. 
The Cai-Lim model consistently produces lower relative per cycle power results. 

In-order issue produces a further power reduction under the Cai-Lim power 
model and Wattch for both 4- wide and 8- wide microarchitecture. In all Wattch 
results, the reduction is significantly larger than that provided by pipeline gat- 
ing. The results under the Cai-Lim model included both statistically significant 
(Figure 2(a) and (b), lisp-gated and lisp-inorder) and insignificant (Figure 2(a) 
and (b), m88ksim-gated and m88ksim-inorder) values. The Cai-Lim model and 
Wattch are also significantly different from each other. 
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Fig. 2. Experiment 1 - Possible IPG Matching Techniques on 4 and 8- wide microarchi- 
tectures. All results are normalized against the appropriate base case. Error estimation 
is based upon the 10% margin of error reported by Wattch at the 90% confidence level. 
Error bars for the base cases are indicated by the box centered around 1. 



The results from the half-width case highlight differences between the models. 
Under the Cai-Lim model, this case is indistinguishable from the pipeline gating 
case (Figure 2 (a) and (c), m88ksim-gated and m88ksim-halfwidth, and lisp- 
gated and lisp-halfwidth). Under the Wattch model, it is instead indistinguish- 
able from the in-order case (Figure 2 (a) and (c), m88ksim-inorder and m88ksim- 
halfwidth, and lisp-inorder and lisp-halfwidth). m88ksim-halfwidth and lisp- 
halfwidth show the Cai-Lim model and Wattch are also not statistically sig- 
nificant from each other in the half-width case. 

The two power models would lead to different choices if they were used to 
select an appropriate lower power mode to exploit periods of low IPC activity. 
The Cai-Lim model indicates in-order issue is the best choice for per cycle relative 
power reduction for both 4- wide and 8- wide designs. Wattch fails to distinguish 
between in-order issue and half-width allowing the designer to arbitrarily chose 
between these two techniques. 

We also explore the affect of these techniques on energy. The effects of small 
differences are magnified because of the increase in the runtime of the program. 
In all cases, the Cai-Lim model indicates an overall reduction in energy. The 
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Wattch model has more variable results, including a number of cases where the 
energy is not statistically different from that of the base case. 

Based on our energy results, we would reach different conclusions depending 
upon which model we used. For example. Figure 2(b) on a 4-wide machine, the 
Cai-Lim model indicates that all three techniques are equally valid and produce 
comparable energy reductions. Under Wattch, we find that the best energy re- 
duction technique is actually that of running at the next smaller machine width. 

We can draw similarly opposing conclusions based upon the 8-wide results 
(see Figure 2(d)). The Cai-Lim model clearly favors the use of in-order issue 
for energy reduction, while Wattch predicts in-order issue will have the worst 
relative energy effects and will, in fact, increase energy. 



Analysis. The Cai-Lim model produces larger power reductions than Wattch in 
all cases except for the half- width 8- wide microarchitecture (a 4- wide machine). 
Although not all of these results lie outside the range required for statistical 
significance, they do indicate an overall trend. The Cai-Lim model appears to 
be better able to distinguish differences in dynamic activity than the Wattch 
model. The results are partially a result of the underlying microarchitectural 
model and partially a result of the types of activities recorded. 

For example, the Wattch model determines whether or not any instructions 
have been fetched this cycle by counting the number of icache and branch pre- 
dictor accesses. The resulting icache power is based on the notion of accessed 
or not accessed. This models an architecture where only a single line is fetched 
unless the pipeline stalls. After a line is fetched, as many instructions as possi- 
ble are used from it. However, the out-of-order simulator used in this case does 
not model this architecture so there is a disparity between the power model’s 
architecture and the architectural simulator’s architecture. In addition, Wattch 
provides no mechanism for power consumption accounting during fetch if the 
fetch engine is able to fetch multiple cache lines per pipeline cycle. 

The Cai-Lim model instead distinguishes between accesses to next PC pre- 
diction and writes to the dispatch queue. The number of accesses and writes are 
factored into separate circuit blocks. This model allows multiple fetches during 
a cycle to be accurately accounted for and shows power reductions when the 
pipeline gating technique is applied partially because the writes to the dispatch 
queue have been halved. This more closely models the underlying simulation 
architecture which can make repeated icache searches during fetch and an archi- 
tecture in which the fetch engine runs faster than the rest of the pipeline. 

The accounting and breakdown done by each model also has implications on 
the power consumed by the register file. For example, on a 4-wide architecture, 
the Cai-Lim model indicates that half as much power is consumed by the register 
file under pipeline gating when it is compared to the base case. Wattch indicates 
a much smaller reduction. The Cai-Lim model produces results that indicate 
larger per cycle power and overall energy savings for the load-store queue (LSQ), 
the register file, and the Register Update Unit (RUU) than Wattch does. 
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Fig. 3. Experiment 2 - Dynamic Instruction Queue Resizing on 4 and 8- wide architec- 
tures. All results are normalized against the appropriate base case. Error estimation is 
based upon the 10% margin of error reported by Wattch at the 90% confidence level. 
Error bars for the base cases are indicated by the box centered around 1. 



Overall, the Cai-Lim model has a finer granularity for determining activity 
within units. For example, Wattch employs a single notion of instruction cache 
(icache) access, but the Cai-Lim model tracks instruction lookaside buffer (itlb) 
accesses, replacements, invalidations, and writebacks in addition to instruction 
tag (itag) and icache contributions that are dependent upon level-1 icache (ill) 
accesses, replacements, invalidations, and writebacks. In addition, each unit is 
broken down into its dynamic, static, clock, memory, and PLA contributors 
allowing the model to be finally tuned by adjusting its power densities and areas 
to match specific target architecture. Unfortunately, this granularity is not fully 
utilized in the current implementation of the Cai-Lim model because the same 
power density is used throughout for each block leading to potentially erroneous 
results. Additionally, the lack of access to the details needed for determining the 
area used by components whose sizes differ from the Cai-Lim base model means 
that these were not adjusted to reflect their new size. 

4.2 Dynamic Instruction Queue Resizing 

Figure 3 shows the results for this experiment. All runs shown are normalized 
against a base case where no power reduction techniques have been applied. We 
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are primarily interested in energy because our intent is to discover if dynamic in- 
struction queue resizing is an appropriate technique for exploiting periods during 
an application’s lifetime where newly arrived instructions are not being issued. 
We are interested in whether or not the models are able to reflect the fact that 
the average instruction queue size under dynamic instruction queue resizing is 
often considerably less than its maximum size and how well conditional clocking 
can be applied to this block in the power models. 

We expected both models to show a relative power reduction due to the 
scaling of the instruction queue. We also expected an energy reduction because 
the overall effect on the runtime of the application is less than 10%. We are 
more concerned with energy results in this case because we are interested in the 
overall effect of applying this technique. 



Results. Under a 4-wide architecture, we observed only a single case where 
the energy reduction was statistically significant using our minimum estimate of 
the error in the power models (Figure 3 (a) Cai-Lim compress). Realistically, the 
error is larger and none of the results are statistically different from the baseline. 
The Wattch and Cai-Lim models also are not distinguishable from each other. 
Our results are much more conservative than Folegnani et al. because we used 
a smaller instruction queue than they did. 

The 8-wide architecture indicates a statistically significant energy reduction 
for all benchmarks except compress when examined with the Cai-Lim model. 
Once again, the Wattch model produced energy reductions that are not statis- 
tically different from the baseline. In this case, the Cai-Lim and Wattch models 
are statistically different from one another for the benchmarks that show the 
largest power reductions (go, mSSksim, and ijpeg). Cai-Lim and Wattch are not 
statistically distinguishable for the remaining benchmarks. 

Under the Cai-Lim model we would conclude that dynamic instruction queue 
resizing is a worthwhile technique to pursue on our 8- wide architecture. We would 
not draw similar conclusions from either power model on the 4-wide architecture 
or from Wattch on the 8- wide architecture. 

We have included the average per cycle power results as well as an indica- 
tion that dynamic instruction queue resizing did not increase the run time of 
the applications inordinately. This can be seen by examining the normalized 
average per cycle power results and comparing them to the normalized energy 
results. Dynamic instruction queue resizing increases the average runtime of an 
application by a few percent. 

Analysis. There are a number of factors that contribute to the resulting dif- 
ferences. One difference in our results generated by the two power models can 
primarily be attributed to the different clock representation methods employed 
by each model. The Wattch model keeps track of the clock power as a separate 
component, while the Cai-Lim model calculates the clock power on a per block 
basis using the amount of area of a block devoted to clock power distribution. 
This allows the Cai-Lim model to scale the clock contribution of array structures 
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more easily. Wattch precalculates the clock contribution and then simply scales 
it for each cycle by the comparing the current cycle’s total power to the maxi- 
mum total power. In order to accurately examine the effects of array structure 
scaling, it would be necessary to either recalculate the clock contribution every 
time an array structure such as the instruction queue is resized or to precompute 
all the possible clock values based on all possible array structure sizes during 
power model initialization. Neither of these solutions was applied in this case. 

Another contributor to the observed differences can be found in the base 
numbers themselves. The RUU includes the reorder buffer behavior. The rename 
registers cannot be gated off because they may still be referenced. The power 
savings then comes from not dispatching new instructions into the “disabled” 
portion of the RUU and not broadcasting values to it. Under the Cai-Lim model, 
these form a larger contribution to the total power than they do under Wattch. 



4.3 Which Model to Use? 

It is unclear which model provides a more accurate picture of the experiments. 
The unknown accuracy of the Cai-Lim model could mean that the results it pro- 
duced are wildly inaccurate. Its use of a single power density for all components 
and lack of accessible structural scaling factors also limit its applicability. The 
less finely grained counters as well as the accounting methods used during fetch 
and decode under Wattch may not account for the types of activity reduction 
the low power modes are expected to produce. The unknown contributions to 
Wattch’s total error that occur under structural scaling suggests that caution 
should be used when applying it outside existing processor designs. 

Further analysis should be done before using either power model in situations 
where the target architecture differs significantly from that used by the power 
model. For example, both power models are based upon the use of an RUU. 
The inaccuracies that may be introduced by attempting to model a reservation 
station-based design may blur any possible power savings. 

5 Conclusions and Future Work 

Although architectural level power modeling tools are important as we explore 
new low power optimization techniques, we find that neither of the two freely 
available tools are sufficiently accurate or complete to allow wide-ranging ex- 
periments. We also urge extreme caution when analyzing results produced by 
either of the current solutions. Our results indicate that even at a 90% confidence 
level, the power models often fail to produce results that are statistically distin- 
guishable from the base case. We also observed situations in which the models 
produce results which are statistically different from each other for the same 
power reduction technique. We find that the current accuracy of the models will 
need to be improved to detect anything but the largest power savings. 

We have found it is necessary to keep in mind both the underlying architec- 
tural model used in the power model as well as the model of the simulator. In 
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situations where the two models differ, such as fetch and decode under Wattch 
and SimpleScalar, one or both models may need to be modified to more closely 
match the desired behavior. 

It is hoped that as these models are revised and enhanced these shortcom- 
ings will be overcome, making them more generally useful and accessible to the 
research community. In particular, we would like to see enhanced versions that 
combine the flexibility of Wattch with the granularity of Cai-Lim. In addition, to 
produce results which we can confidently state are significant, it will be necessary 
for new models to more thoroughly analyze the error in individual components 
as well as the overall error. 
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